Stage 2: Model training

[1]:
import anndata
import itertools
import networkx as nx
import pandas as pd
import scanpy as sc
import scglue
import seaborn as sns
from matplotlib import rcParams
[2]:
scglue.plot.set_publication_params()
rcParams["figure.figsize"] = (4, 4)

Read preprocessed data

First, read the preprocessed data as produced by stage 1.

[3]:
rna = anndata.read_h5ad("rna_preprocessed.h5ad")
atac = anndata.read_h5ad("atac_preprocessed.h5ad")
graph = nx.read_graphml("prior.graphml.gz")

Configure data

(Estimated time: negligible)

Before model training, we need to configure the datasets using scglue.models.configure_dataset. For each dataset to be integrated, we specify the probabilistic generative model to use. Here we model the raw counts of both scRNA-seq and scATAC-seq using the negative binomial distribution ("NB").

Optionally, we can specify whether only the highly variable features should be used (use_highly_variable), what data layer to use (use_layer), as well as what preprocessing embedding (use_rep) to use as first encoder transformation.

  • For the scRNA-seq data, we use the previously backed up raw counts in the “counts” layer, and use the PCA embedding as the first encoder transformation.

  • For the scATAC-seq data, the raw counts are just atac.X, so it’s unnecessary to specify use_layer. We use the LSI embedding as the first encoder transformation.

[4]:
scglue.models.configure_dataset(
    rna, "NB", use_highly_variable=True,
    use_layer="counts", use_rep="X_pca"
)
[5]:
scglue.models.configure_dataset(
    atac, "NB", use_highly_variable=True,
    use_rep="X_lsi"
)

Accordingly, we also subset the prior graph to retain highly variable features only.

[6]:
graph = graph.subgraph(itertools.chain(
    rna.var.query("highly_variable").index,
    atac.var.query("highly_variable").index
))

Build and train GLUE model

(Estimated time: 15-60 min, depending on computation device)

Next we train a GLUE model for integrating the two omics layers.

  • The datasets to be integrated are specified as a dict, where the keys are domain names. The domain names can be set at your discretion, as long as they are kept consistent (see below).

  • Here we specified a directory to the fit function where model snapshots and training logs will be stored.

  • For more advanced usages, please refer to the function documentation.

[7]:
glue = scglue.models.fit_SCGLUE(
    {"rna": rna, "atac": atac}, graph,
    fit_kws={"directory": "glue"}
)
[INFO] fit_SCGLUE: Pretraining SCGLUE model...
[INFO] autodevice: Using GPU 2 as computation device.
[INFO] SCGLUEModel: Setting `graph_batch_size` = 27025
[INFO] SCGLUEModel: Setting `max_epochs` = 186
[INFO] SCGLUEModel: Setting `patience` = 16
[INFO] SCGLUEModel: Setting `reduce_lr_patience` = 8
[INFO] SCGLUETrainer: Using training directory: "glue/pretrain"
[INFO] SCGLUETrainer: [Epoch 10] train={'g_nll': 0.447, 'g_kl': 0.005, 'g_elbo': 0.452, 'x_rna_nll': 0.164, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.171, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.041, 'dsc_loss': 0.691, 'vae_loss': 0.229, 'gen_loss': 0.195}, val={'g_nll': 0.445, 'g_kl': 0.005, 'g_elbo': 0.45, 'x_rna_nll': 0.17, 'x_rna_kl': 0.007, 'x_rna_elbo': 0.177, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.041, 'dsc_loss': 0.693, 'vae_loss': 0.235, 'gen_loss': 0.201}, 6.0s elapsed
[INFO] SCGLUETrainer: [Epoch 20] train={'g_nll': 0.429, 'g_kl': 0.004, 'g_elbo': 0.433, 'x_rna_nll': 0.162, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.168, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.693, 'vae_loss': 0.226, 'gen_loss': 0.191}, val={'g_nll': 0.426, 'g_kl': 0.004, 'g_elbo': 0.43, 'x_rna_nll': 0.169, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.176, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.694, 'vae_loss': 0.233, 'gen_loss': 0.198}, 5.7s elapsed
[INFO] SCGLUETrainer: [Epoch 30] train={'g_nll': 0.421, 'g_kl': 0.004, 'g_elbo': 0.425, 'x_rna_nll': 0.161, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.167, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.693, 'vae_loss': 0.224, 'gen_loss': 0.189}, val={'g_nll': 0.421, 'g_kl': 0.004, 'g_elbo': 0.425, 'x_rna_nll': 0.168, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.174, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.041, 'dsc_loss': 0.694, 'vae_loss': 0.231, 'gen_loss': 0.197}, 5.8s elapsed
[INFO] SCGLUETrainer: [Epoch 40] train={'g_nll': 0.417, 'g_kl': 0.004, 'g_elbo': 0.421, 'x_rna_nll': 0.161, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.167, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.693, 'vae_loss': 0.224, 'gen_loss': 0.189}, val={'g_nll': 0.416, 'g_kl': 0.004, 'g_elbo': 0.419, 'x_rna_nll': 0.167, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.173, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.693, 'vae_loss': 0.23, 'gen_loss': 0.195}, 5.6s elapsed
Epoch    42: reducing learning rate of group 0 to 2.0000e-04.
Epoch    42: reducing learning rate of group 0 to 2.0000e-04.
[INFO] LRScheduler: Learning rate reduction: step 1
[INFO] SCGLUETrainer: [Epoch 50] train={'g_nll': 0.414, 'g_kl': 0.004, 'g_elbo': 0.418, 'x_rna_nll': 0.16, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.166, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.693, 'vae_loss': 0.222, 'gen_loss': 0.187}, val={'g_nll': 0.415, 'g_kl': 0.004, 'g_elbo': 0.418, 'x_rna_nll': 0.165, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.171, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.693, 'vae_loss': 0.228, 'gen_loss': 0.193}, 6.3s elapsed
[INFO] SCGLUETrainer: [Epoch 60] train={'g_nll': 0.414, 'g_kl': 0.004, 'g_elbo': 0.418, 'x_rna_nll': 0.16, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.166, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.693, 'vae_loss': 0.222, 'gen_loss': 0.188}, val={'g_nll': 0.414, 'g_kl': 0.004, 'g_elbo': 0.418, 'x_rna_nll': 0.165, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.171, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.693, 'vae_loss': 0.228, 'gen_loss': 0.193}, 6.2s elapsed
Epoch    65: reducing learning rate of group 0 to 2.0000e-05.
Epoch    65: reducing learning rate of group 0 to 2.0000e-05.
[INFO] LRScheduler: Learning rate reduction: step 2
[INFO] SCGLUETrainer: [Epoch 70] train={'g_nll': 0.414, 'g_kl': 0.004, 'g_elbo': 0.417, 'x_rna_nll': 0.159, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.165, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.039, 'dsc_loss': 0.693, 'vae_loss': 0.221, 'gen_loss': 0.187}, val={'g_nll': 0.413, 'g_kl': 0.004, 'g_elbo': 0.416, 'x_rna_nll': 0.164, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.17, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.693, 'vae_loss': 0.227, 'gen_loss': 0.192}, 5.9s elapsed
2022-01-18 21:54:08,349 ignite.handlers.early_stopping.EarlyStopping INFO: EarlyStopping: Stop training
[INFO] EarlyStopping: Restoring checkpoint "70"...
[INFO] fit_SCGLUE: Estimating balancing weight...
[INFO] estimate_balancing_weight: Clustering cells...
[INFO] estimate_balancing_weight: Matching clusters...
[INFO] estimate_balancing_weight: Matching array shape = (17, 18)...
[INFO] estimate_balancing_weight: Estimating balancing weight...
[INFO] fit_SCGLUE: Fine-tuning SCGLUE model...
[INFO] SCGLUEModel: Setting `graph_batch_size` = 27025
[INFO] SCGLUEModel: Setting `align_burnin` = 31
[INFO] SCGLUEModel: Setting `max_epochs` = 186
[INFO] SCGLUEModel: Setting `patience` = 16
[INFO] SCGLUEModel: Setting `reduce_lr_patience` = 8
[INFO] SCGLUETrainer: Using training directory: "glue/fine-tune"
[INFO] SCGLUETrainer: [Epoch 10] train={'g_nll': 0.411, 'g_kl': 0.004, 'g_elbo': 0.414, 'x_rna_nll': 0.16, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.166, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.691, 'vae_loss': 0.222, 'gen_loss': 0.188}, val={'g_nll': 0.411, 'g_kl': 0.004, 'g_elbo': 0.414, 'x_rna_nll': 0.167, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.173, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.694, 'vae_loss': 0.229, 'gen_loss': 0.195}, 5.9s elapsed
[INFO] SCGLUETrainer: [Epoch 20] train={'g_nll': 0.408, 'g_kl': 0.004, 'g_elbo': 0.412, 'x_rna_nll': 0.16, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.166, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.692, 'vae_loss': 0.222, 'gen_loss': 0.187}, val={'g_nll': 0.408, 'g_kl': 0.004, 'g_elbo': 0.412, 'x_rna_nll': 0.168, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.174, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.695, 'vae_loss': 0.23, 'gen_loss': 0.195}, 6.0s elapsed
[INFO] SCGLUETrainer: [Epoch 30] train={'g_nll': 0.407, 'g_kl': 0.004, 'g_elbo': 0.411, 'x_rna_nll': 0.16, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.165, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.694, 'vae_loss': 0.221, 'gen_loss': 0.187}, val={'g_nll': 0.407, 'g_kl': 0.004, 'g_elbo': 0.411, 'x_rna_nll': 0.167, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.173, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.679, 'vae_loss': 0.229, 'gen_loss': 0.195}, 6.0s elapsed
[INFO] SCGLUETrainer: [Epoch 40] train={'g_nll': 0.406, 'g_kl': 0.004, 'g_elbo': 0.41, 'x_rna_nll': 0.161, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.166, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.039, 'dsc_loss': 0.691, 'vae_loss': 0.222, 'gen_loss': 0.187}, val={'g_nll': 0.406, 'g_kl': 0.004, 'g_elbo': 0.409, 'x_rna_nll': 0.166, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.172, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.68, 'vae_loss': 0.228, 'gen_loss': 0.194}, 5.9s elapsed
Epoch    42: reducing learning rate of group 0 to 2.0000e-04.
Epoch    42: reducing learning rate of group 0 to 2.0000e-04.
[INFO] LRScheduler: Learning rate reduction: step 1
[INFO] SCGLUETrainer: [Epoch 50] train={'g_nll': 0.406, 'g_kl': 0.004, 'g_elbo': 0.409, 'x_rna_nll': 0.159, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.165, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.039, 'dsc_loss': 0.695, 'vae_loss': 0.221, 'gen_loss': 0.186}, val={'g_nll': 0.406, 'g_kl': 0.004, 'g_elbo': 0.41, 'x_rna_nll': 0.165, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.17, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.702, 'vae_loss': 0.227, 'gen_loss': 0.192}, 6.1s elapsed
[INFO] SCGLUETrainer: [Epoch 60] train={'g_nll': 0.406, 'g_kl': 0.004, 'g_elbo': 0.409, 'x_rna_nll': 0.159, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.165, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.691, 'vae_loss': 0.221, 'gen_loss': 0.186}, val={'g_nll': 0.406, 'g_kl': 0.004, 'g_elbo': 0.409, 'x_rna_nll': 0.165, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.17, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.691, 'vae_loss': 0.227, 'gen_loss': 0.192}, 6.1s elapsed
Epoch    65: reducing learning rate of group 0 to 2.0000e-05.
Epoch    65: reducing learning rate of group 0 to 2.0000e-05.
[INFO] LRScheduler: Learning rate reduction: step 2
[INFO] SCGLUETrainer: [Epoch 70] train={'g_nll': 0.405, 'g_kl': 0.004, 'g_elbo': 0.409, 'x_rna_nll': 0.159, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.164, 'x_atac_nll': 0.039, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.039, 'dsc_loss': 0.692, 'vae_loss': 0.22, 'gen_loss': 0.185}, val={'g_nll': 0.406, 'g_kl': 0.004, 'g_elbo': 0.409, 'x_rna_nll': 0.164, 'x_rna_kl': 0.006, 'x_rna_elbo': 0.17, 'x_atac_nll': 0.04, 'x_atac_kl': 0.0, 'x_atac_elbo': 0.04, 'dsc_loss': 0.68, 'vae_loss': 0.226, 'gen_loss': 0.192}, 5.4s elapsed
2022-01-18 22:03:00,324 ignite.handlers.early_stopping.EarlyStopping INFO: EarlyStopping: Stop training
[INFO] EarlyStopping: Restoring checkpoint "70"...

If you have tensorboard installed, you can monitor the training progress by running tensorboard --logdir=glue at the command line.

After convergence, the trained model can be saved and loaded as “.dill” files.

[8]:
glue.save("glue.dill")
# glue = scglue.models.load_model("glue.dill")

Check integration diagnostics

(Estimated time: ~2 min)

To check whether the integration is reliable, we provide an “integration consistency score”, which quantifies the consistency between the integration result and the guidance graph. The score can be computed using the scglue.models.integration_consistency function.

  • We need to provide the function with the trained model, data, as well as the guidance graph.

  • We also need to explicitly specify layers containing raw counts if it is not .X.

[9]:
dx = scglue.models.integration_consistency(
    glue, {"rna": rna, "atac": atac}, graph,
    count_layers={"rna": "counts"}
)
dx
[INFO] get_metacells: Clustering metacells...
[INFO] get_metacells: Aggregating metacells...
[INFO] metacell_corr: Computing correlation on 10 common metacells...
[INFO] get_metacells: Clustering metacells...
[INFO] get_metacells: Aggregating metacells...
[INFO] metacell_corr: Computing correlation on 20 common metacells...
[INFO] get_metacells: Clustering metacells...
[INFO] get_metacells: Aggregating metacells...
[INFO] metacell_corr: Computing correlation on 50 common metacells...
[INFO] get_metacells: Clustering metacells...
[INFO] get_metacells: Aggregating metacells...
[INFO] metacell_corr: Computing correlation on 100 common metacells...
[INFO] get_metacells: Clustering metacells...
[INFO] get_metacells: Aggregating metacells...
[INFO] metacell_corr: Computing correlation on 200 common metacells...
[9]:
n_meta consistency
0 10 0.177807
1 20 0.149832
2 50 0.109070
3 100 0.087593
4 200 0.064799

Notice that the consistency score is computed across different numbers of “metacells”, which can be visualized as a curve:

[10]:
_ = sns.lineplot(x="n_meta", y="consistency", data=dx).axhline(y=0.05, c="darkred", ls="--")
_images/training_25_0.png

The higher is curve gets, the more confident the integration is. Empirically, it is safe to assume that the integration is reliable if the curve is above the 0.05 line.

Apply model for cell and feature embedding

(Estimated time: ~2 min)

With the trained model, we can use the encode_data method to project the single-cell omics data to cell embeddings. The first argument to encode_data specifies the domain to encode (one of the previous domain names), and the second specifies the dataset to be encoded. By convention, we store the cell embeddings in the obsm slot, with name "X_glue".

[11]:
rna.obsm["X_glue"] = glue.encode_data("rna", rna)
atac.obsm["X_glue"] = glue.encode_data("atac", atac)

To jointly visualize the cell embeddings from two omics layers, we construct a combined dataset.

[12]:
combined = anndata.concat([rna, atac])

Then we use UMAP to visualize the aligned embeddings. We can see that the two omics layers are now correctly aligned.

[13]:
sc.pp.neighbors(combined, use_rep="X_glue", metric="cosine")
sc.tl.umap(combined)
sc.pl.umap(combined, color=["cell_type", "domain"], wspace=0.65)
... storing 'domain' as categorical
_images/training_34_1.png

To obtain feature embeddings, we can use the encode_graph method.

[14]:
feature_embeddings = glue.encode_graph(graph)
feature_embeddings = pd.DataFrame(feature_embeddings, index=glue.vertices)
feature_embeddings.iloc[:5, :5]
[14]:
0 1 2 3 4
0610009B22Rik 0.002210 0.001104 -0.003629 -0.002921 0.002139
0610025J13Rik 0.001776 -0.004610 -0.003437 -0.002904 -0.002073
1110002J07Rik 0.000559 0.006707 -0.001212 0.000411 -0.005088
1110006O24Rik 0.004753 -0.001475 -0.000714 -0.001817 0.002758
1110020A21Rik -0.005927 0.003265 -0.005822 -0.005803 0.000757

For regulatory inference based on the feature embeddings, please refer to our case study.