Skip to content

Subtomogram Embeddings Generation

The subtomogram embeddings module extracts fixed-length feature representations from segmented subtomograms. These embeddings capture the 3D structural and textural properties of individual macromolecular complexes and enable unsupervised comparison, visualization, and clustering.

Embeddings are computed using a self-supervised SimSiam model fine-tuned with contrastive learning and can be generated from either:

  • Instance segmentation masks, or
  • Particle center coordinates (e.g. from particle identification or external pickers)

Supported pipelines: - Instance Segmentation → Subtomogram Embeddings → Visualization → (Optional Clustering) - Provided center coordinates (example Particle Identification) → Subtomogram Embeddings → Visualization → (Optional Clustering)


Overview

For each detected instance, CryoSiam extracts a local 3D subtomogram and maps it into a high-dimensional embedding space.

Quick guide

  • You have instance masks → use instance-based embeddings
  • You only have particle coordinates → use center-based embeddings

Input modes for subtomogram extraction

CryoSiam supports two alternative input modes.

1. Instance-based embeddings

Subtomograms are extracted from instance segmentation masks.

Required input

  • instances_mask_folder

Characteristics

  • Object shapes define subtomogram regions
  • Supports masking strategies (masking_type)
  • Recommended when instance segmentation is available

2. Center-based embeddings

Subtomograms are extracted as fixed-size cubes centered on particle coordinates.

This mode is intended for:

  • Particle Identification outputs
  • External particle pickers
  • Manually curated coordinates

Required input

  • centers_file
  • centers_patch_size

Instance segmentation is not required in this mode.

Important constraint

When using center-based embeddings: - centers_patch_size should be close to the expected physical size of the particle. > Overly large patches include excessive background signal and significantly degrade embedding quality.


Embedding models and masking strategies

CryoSiam provides three embedding variants, controlled by the masking_type parameter (when working with masks provided from instance segmentation). Each variant corresponds to a different strategy for masking background signal when extracting subtomograms:

Masking type Description When to use
0 No masking – the raw subtomogram is extracted without applying an instance mask Use when instance masks are unreliable or when full surrounding context is desired
1 Convex hull masking – the instance mask is expanded to its convex hull Recommended default; balances object focus with local context
2 Strict masking – only voxels inside the instance mask are retained Use when isolating object shape and internal structure is critical

Each masking strategy corresponds to a separately trained embedding model. Make sure that the selected masking_type matches the trained model specified by trained_model.

Note:
For center-based embeddings, only masking_type: 0 (no masking) is supported. Masking types 1 and 2 require instance masks and are incompatible with center-based extraction.

Recommendation:
Use masking_type: 1 (convex hull masking) for most datasets, as it avoids errors with strict instance masking while suppressing background noise.


Example Results

  • UMAP visualization: each point represents a subtomogram embedding
  • KMeans clustering: coarse grouping of similar structures
  • Spectral clustering: captures fine-grained structural variability

Embedding space projection (UMAP)

2D UMAP embedding

KMeans clustering

KMeans clusters

Spectral clustering

Spectral clusters


Trained Model

Pre-trained embedding models are available for the different masking strategies.

Example model (convex hull masking): CryoSiam subtomogram embedding convex-hull model (v1.0)

Example model for centers (no masking): CryoSiam subtomogram embedding no masking model (v1.0)

A list of all the provided models is available here: Trained models


Running subtomogram embeddings

Generate embeddings from instance masks

cryosiam simsiam_embeddings_predict --config_file=configs/subtomo_embeddings.yaml

To process a single tomogram only:

cryosiam simsiam_embeddings_predict --config_file=configs/subtomo_embeddings.yaml --filename TS_01.mrc

Generate embeddings from particle centers

cryosiam simsiam_embeddings_from_centers_predict --config_file configs/subtomo_embeddings.yaml

To process a single tomogram only:

cryosiam simsiam_embeddings_from_centers_predict --config_file=configs/subtomo_embeddings.yaml --filename TS_01.mrc

This command:

  • Reads particle centers from a .star or .csv file
  • Extracts fixed-size subtomograms around each center
  • Computes embeddings using the selected SimSiam model

Particle centers file format

When using center-based embeddings, particle coordinates must be provided in a .star or .csv file.

Mandatory fields

Field name STAR equivalent Description
tomo rlnMicrographName Tomogram name or path
centroid-0 rlnCoordinateZ Z coordinate (voxel)
centroid-1 rlnCoordinateY Y coordinate (voxel)
centroid-2 rlnCoordinateX X coordinate (voxel)
  • One row per particle
  • Coordinates must be in voxel space
  • All tomograms to be processed must be listed in the same file
  • Additional columns are allowed and ignored

Visualize Embeddings

cryosiam simsiam_visualize_embeddings --config_file=configs/subtomo_embeddings.yaml

This command generates PCA/UMAP projections and distance maps for qualitative inspection.


(Optional) Cluster embeddings

KMeans clustering:

cryosiam simsiam_embeddings_kmeans_clustering --config_file=configs/subtomo_embeddings.yaml

Spectral clustering:

cryosiam simsiam_embeddings_spectral_clustering --config_file=configs/subtomo_embeddings.yaml

Example Configuration (configs/config_subtomo_embeddings.yaml)

Download example config

data_folder: '/scratch/stojanov/dataset1/predictions/denoised'
instances_mask_folder: '/scratch/stojanov/dataset1/predictions/instances'
centers_file: '/scratch/stojanov/dataset1/ribosome_centers.star'
centers_patch_size: 32
prediction_folder: '/scratch/stojanov/dataset1/predictions/subtomo_embeds'
trained_model: '/g/zaugg/stojanov/simulated_datasets/final_models/simsiam_contrastive/version_1/model/last.ckpt'
contrastive: True
file_extension: '.mrc'

test_files: null
clustering_files: null
visualization_files: null

min_particle_size: 10
max_particle_size: null
masking_type: 1
expand_labels: 3

clustering_kmeans:
  num_clusters: 6
  visualization: True

clustering_spectral:
  num_clusters: 6
  estimate_num_clusters: False
  visualization: True

visualization:
  prediction_folder: '/scratch/stojanov/dataset1/predictions/subtomo_embeds/vis'
  distance: 'euclidean'
  pca_components: null
  visualization_suffix: 'instance_regions.csv'
  visualize_umap: True
  3d_umap: False

parameters:
  data:
    patch_size: [ 64, 64, 64 ]
    patch_overlap: null
    min: 0
    max: 1
    mean: 0
    std: 1
  network:
    spatial_dims: 3
    in_channels: 1
    dim: 1024

hyper_parameters:
  batch_size: 10

Config Reference

Top‑level keys

Key Type Must change the default value Description
data_folder str Path to denoised tomograms
instances_mask_folder str Path to instance segmentation masks; null when working with centers
centers_file str Path to particle centers file; null when working with instance masks
centers_patch_size int Patch size around particle centers; null when working with instance masks
prediction_folder str Output directory for embeddings
trained_model str SimSiam embedding model checkpoint (.ckpt)
contrastive bool Indicates contrastive (SimSiam) training
file_extension str Input file extension (.mrc or .rec, default: .mrc)
test_files list[str] or null Specific tomograms to process; null processes all files
min_particle_size int Minimum voxel size of valid instances
max_particle_size int or null Maximum voxel size (optional); null = no limit.
masking_type int Mask generation method (0 = no masking, 1 = convex hull masking, 2 - strict masking)
expand_labels int Number of voxels to expand around mask boundaries for convex hull or strict masking

clustering_kmeans

Key Type Must change the default value Description
num_clusters int Number of clusters for KMeans algorithm
visualization bool If true, generate scatter/UMAP plots of the embeddings

clustering_spectral

Key Type Must change the default value Description
num_clusters int Expected number of spectral clusters
estimate_num_clusters bool If true, automatically estimate cluster number
visualization bool Enable cluster visualizations

visualization

Key Type Must change the default value Description
prediction_folder str Directory for saving visualizations and projections.
distance str Metric for pairwise similarity (euclidean, cosine, etc.).
pca_components int or null Number of PCA components before projection.
visualization_suffix str CSV file containing mapping between IDs and embedding vectors.
visualize_umap bool Run 2D UMAP projection for visualization.
3d_umap bool Run 3D UMAP visualization (interactive).

parameters

Key Type Must change the default value Description
data.patch_size list[int] Sliding-window patch size for 3D inference
data.min float Intensity minimum value for data scaling
data.max float Intensity maximum value for data scaling
data.mean float Mean used for normalization
data.std float Std used for normalization
network.in_channels int Number of input channels (usually 1)
network.spatial_dims int Dimensionality of the model (3 for tomograms)
network.dim int Dimension of embedding space (e.g., 1024).

hyper_parameters

Key Type Must change the default value Description
batch_size int Number of subtomograms per batch (default 10)

Troubleshooting

Symptom Suggested Fix
Empty embedding CSV Check instance masks and instances_mask_folder
Few embeddings Lower min_particle_size
GPU memory error Reduce batch_size
Clusters overlap visually Increase num_clusters or use spectral clustering
Embeddings dominated by background Reduce centers_patch_size

Next Steps