Stage 1: Data preprocessing

In this tutorial, we will show how to prepare the necessary data for GLUE model training, using the SNARE-seq data (Chen, et al. 2019) as an example. The SNARE-seq data consists of paired scRNA-seq and scATAC-seq profiles, but we will treat them as unpaired and try to align these two omics layers using GLUE.

[1]:
import anndata as ad
import networkx as nx
import scanpy as sc
import scglue
from matplotlib import rcParams
[2]:
scglue.plot.set_publication_params()
rcParams["figure.figsize"] = (4, 4)

Read data

First, we need to prepare the scRNA-seq and scATAC-seq data into AnnData objects. AnnData is the standard data class we use in scglue. See their documentation for more details if you are unfamiliar, including how to construct AnnData objects from scratch, and how to read data in other formats (csv, mtx, loom, etc.) into AnnData objects.

Here we just load existing h5ad files, which is the native file format for AnnData. The h5ad files used in this tutorial can be downloaded from here:

[3]:
rna = ad.read_h5ad("Chen-2019-RNA.h5ad")
rna
[3]:
AnnData object with n_obs × n_vars = 9190 × 28930
    obs: 'domain', 'cell_type'
[4]:
atac = ad.read_h5ad("Chen-2019-ATAC.h5ad")
atac
[4]:
AnnData object with n_obs × n_vars = 9190 × 241757
    obs: 'domain', 'cell_type'

Preprocess scRNA-seq data

(Estimated time: ~2 min)

To begin with, the scRNA-seq expression matrix is supposed to contain raw UMI counts:

[5]:
rna.X, rna.X.data
[5]:
(<9190x28930 sparse matrix of type '<class 'numpy.float32'>'
        with 8633857 stored elements in Compressed Sparse Row format>,
 array([1., 1., 1., ..., 1., 1., 1.], dtype=float32))

Before any preprocessing, we back up the raw UMI counts in a layer called “counts”. It will be used later during model training.

[6]:
rna.layers["counts"] = rna.X.copy()

Then follow a minimal scanpy pipeline for data preprocessing (see their tutorial if you are unfamiliar).

First up we use the “seurat_v3” method to select 2,000 highly variable genes.

[7]:
sc.pp.highly_variable_genes(rna, n_top_genes=2000, flavor="seurat_v3")

Then normalize, scale the data, and perform dimension reduction via PCA. By default, 100 principal components are used.

The PCA embedding will be used in stage 2 as the first encoder transformation to reduce model size.

[8]:
sc.pp.normalize_total(rna)
sc.pp.log1p(rna)
sc.pp.scale(rna)
sc.tl.pca(rna, n_comps=100, svd_solver="auto")

Optionally, we can visualize the RNA domain with UMAP.

[9]:
sc.pp.neighbors(rna, metric="cosine")
sc.tl.umap(rna)
sc.pl.umap(rna, color="cell_type")
_images/preprocessing_15_0.png

Preprocess scATAC-seq data

(Estimated time: ~2 min)

Similar to scRNA-seq, the scATAC-seq accessibility matrix is also supposed to contain raw counts.

[10]:
atac.X, atac.X.data
[10]:
(<9190x241757 sparse matrix of type '<class 'numpy.float32'>'
        with 22575391 stored elements in Compressed Sparse Row format>,
 array([1., 1., 1., ..., 1., 1., 1.], dtype=float32))

For scATAC-seq, we apply the latent semantic indexing (LSI) for dimension reduction, using the function scglue.data.lsi. This is just a Python reimplementation of the LSI function in Signac. We also set the dimensionality to 100. The other keyword argument n_iter=15 is passed to sklearn.utils.extmath.randomized_svd. Setting it to larger values increases the precision of randomized SVD.

The LSI embedding will be used in stage 2 as the first encoder transformation to reduce model size.

[11]:
scglue.data.lsi(atac, n_components=100, n_iter=15)

Optionally, we may also visualize the ATAC domain with UMAP.

[12]:
sc.pp.neighbors(atac, use_rep="X_lsi", metric="cosine")
sc.tl.umap(atac)
[13]:
sc.pl.umap(atac, color="cell_type")
_images/preprocessing_22_0.png

Construct prior regulatory graph

Estimated time: ~2 min

Next, we need to construct the guidance graph, which will be utilized by GLUE to orient the multi-omics alignment. The graph should contain omics features as nodes (e.g., genes for scRNA-seq, and peaks for scATAC-seq), and prior regulatory interactions as edges.

GLUE accepts guidance graph in the form of networkx graph objects (see this introduction if you are unfamilar). So, in principle, you can manually construct the guidance graph tailored to your need, as long as the graph complies to the following standards:

  • The graph nodes should cover all omics features in the datasets to be integrated.

  • The graph edges should contain “weight” and “sign” as edge attributes. Weights should have range (0, 1], and signs should take values of either 1 or -1.

  • The graph should contain a self-loop for each node, with weight = 1, and sign = 1.

  • It is recommended to use undirected graphs (Graph, MultiGraph), or symmetric directed graphs (DiGraph, MultiDiGraph).

Below, we show how to construct a guidance graph for scRNA-seq and scATAC-seq integration, using builtin functions in scglue.

Obtain genomic coordinates

The most commonly used prior information linking ATAC peaks with genes is genomic proximity. To do so, we need the genomic coordinates for peaks and genes, and store them as feature meta data in the var slot.

For typical scRNA-seq datasets, only the gene names/IDs are stored, but not their coordinates, as is the case here:

[14]:
rna.var.head()
[14]:
highly_variable highly_variable_rank means variances variances_norm mean std
genes
0610005C13Rik False NaN 0.001415 0.001413 0.958918 0.000832 0.024802
0610009B22Rik True 1528.0 0.017301 0.022228 1.126013 0.010283 0.091634
0610009E02Rik False NaN 0.014799 0.017193 1.023411 0.008182 0.076223
0610009L18Rik False NaN 0.016540 0.020403 1.082735 0.009849 0.088678
0610010F05Rik False NaN 0.157563 0.197176 1.004244 0.086493 0.245927

So, we provide a utility function scglue.data.get_gene_annotation to supplement the coordinate information from GTF files. The following usage assumes that the rna.var_names correspond to “gene_name” attribute in the GTF file. For other cases, please check the function documentation.

The GTF file used here can be downloaded from GENCODE.

[15]:
scglue.data.get_gene_annotation(
    rna, gtf="gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz",
    gtf_by="gene_name"
)
rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].head()
[15]:
chrom chromStart chromEnd
genes
0610005C13Rik chr7 45567793 45575327
0610009B22Rik chr11 51685385 51688874
0610009E02Rik chr2 26445695 26459390
0610009L18Rik chr11 120348677 120351190
0610010F05Rik chr11 23564960 23633639

Note that the coordinates have column names “chrom”, “chromStart” and “chromEnd”, corresponding to the first three columns of the BED format. These exact column names are required. Alternative names like “chr”, “start”, “end” are NOT recognized.

For the scATAC-seq data, the coordinates are already in var_names. We just need to extract them.

[16]:
atac.var_names[:5]
[16]:
Index(['chr1:3005833-3005982', 'chr1:3094772-3095489', 'chr1:3119556-3120739',
       'chr1:3121334-3121696', 'chr1:3134637-3135032'],
      dtype='object', name='peaks')
[17]:
split = atac.var_names.str.split(r"[:-]")
atac.var["chrom"] = split.map(lambda x: x[0])
atac.var["chromStart"] = split.map(lambda x: x[1]).astype(int)
atac.var["chromEnd"] = split.map(lambda x: x[2]).astype(int)
atac.var.head()
[17]:
chrom chromStart chromEnd
peaks
chr1:3005833-3005982 chr1 3005833 3005982
chr1:3094772-3095489 chr1 3094772 3095489
chr1:3119556-3120739 chr1 3119556 3120739
chr1:3121334-3121696 chr1 3121334 3121696
chr1:3134637-3135032 chr1 3134637 3135032

Graph construction

Now that we have the genomic coordinates for all omics features, we can use the scglue.genomics.rna_anchored_guidance_graph function to construct the guidance graph.

By default, an ATAC peak is connected to a gene if they overlap in either the gene body or promoter region. See the function documentation for adjustable settings.

[18]:
guidance = scglue.genomics.rna_anchored_guidance_graph(rna, atac)
guidance
[18]:
<networkx.classes.multidigraph.MultiDiGraph at 0x7f8e003f1610>

We can verify that the obtained guidance graph complies to all of the previous standards using the scglue.graph.check_graph function.

[19]:
scglue.graph.check_graph(guidance, [rna, atac])
[INFO] check_graph: Checking variable coverage...
[INFO] check_graph: Checking edge attributes...
[INFO] check_graph: Checking self-loops...
[INFO] check_graph: Checking graph symmetry...
[INFO] check_graph: All checks passed!

In the meantime, note that highly variable features have been propagated to the ATAC domain, by marking peaks reachable from the highly variable genes in the guidance graph:

[20]:
atac.var.head()
[20]:
chrom chromStart chromEnd highly_variable
peaks
chr1:3005833-3005982 chr1 3005833 3005982 False
chr1:3094772-3095489 chr1 3094772 3095489 False
chr1:3119556-3120739 chr1 3119556 3120739 False
chr1:3121334-3121696 chr1 3121334 3121696 False
chr1:3134637-3135032 chr1 3134637 3135032 False

If the rna_anchored_guidance_graph function doesn’t meet the need (e.g., if you want to incorporate experimental regulatory evidences), you may need to construct the guidance graph manually. We also provide some lower-level utilities like to help (e.g., scglue.genomics.window_graph, scglue.graph.compose_multigraph, scglue.graph.reachable_vertices). Please refer to our case study for an example, where we combined genomic proximity with pcHi-C and eQTL evidences to construct a hybrid prior regulatory graph.

Save preprocessed data files

Finally, we save the preprocessed data, for use in stage 2.

[21]:
rna.write("rna-pp.h5ad", compression="gzip")
atac.write("atac-pp.h5ad", compression="gzip")
nx.write_graphml(guidance, "guidance.graphml.gz")