Identifying SVGs on brain coronal data using SINFONIA¶
The following tutorial demonstrates how to use SINFONIA for identifying spatially variable genes (SVGs) on a mouse brain coronal dataset (10X Genomics, 2021).
There are two parts in this tutorial:
Integrating SINFONIA into SCANPY. This part will show you how to seamlessly integrate SINFONIA into the SCANPY vignette for spatial transcriptomic data.
Evaluating the performance for deciphering spatial domain. This part will show you how to evaluate the performance of identified SVGs for deciphering spatial domain, and reproduce the results in the manuscript of SINFONIA.
[1]:
import numpy as np
import pandas as pd
import scanpy as sc
import sinfonia
import warnings
warnings.filterwarnings("ignore")
On a unix system, you can uncomment and execute the following command to download the brain coronal dataset in AnnData format.
[2]:
# !wget https://health.tsinghua.edu.cn/software/sinfonia/data/10X_Brain_Coronal.h5ad
[3]:
Brain_Coronal = sc.read('10X_Brain_Coronal.h5ad')
Brain_Coronal
[3]:
AnnData object with n_obs × n_vars = 2800 × 32285
obs: 'label'
var: 'gene_id', 'gene_name'
obsm: 'spatial'
Integrating SINFONIA into SCANPY¶
First, we set a random seed for reproducibility.
[4]:
sinfonia.setup_seed(2022)
We then follow the SCANPY vignette for spatial transcriptomic data to process the brain coronal dataset. In order to avoid subjective factors in quality control, we start from the count matrix.
We filter out the genes with zero counts, normalize and logarithmize the data.
[5]:
adata = Brain_Coronal.copy()
sc.pp.filter_genes(adata, min_cells=1)
sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
Next, we detect SVGs via SINFONIA with the same n_top_genes as in the SCANPY vignette.
[6]:
adata = sinfonia.spatially_variable_genes(adata, n_top_genes=2000, subset=True)
The mode used to identify SVGs is stored in adata.uns['svg']. Boolean indicators of SVGs are stored in adata.var['spatially_variable']. Moran’s I scores of all the genes are stored in adata.var['moranI'], while rescaled Geary’s C scores of all the genes are stored in adata.var['gearyC'].
[7]:
adata
[7]:
View of AnnData object with n_obs × n_vars = 2800 × 2176
obs: 'label'
var: 'gene_id', 'gene_name', 'n_cells', 'spatially_variable', 'moranI', 'gearyC'
uns: 'log1p', 'svg'
obsm: 'spatial'
We then embed and cluster the manifold encoded by transcriptional similarity.
[8]:
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.louvain(adata, key_added="default_louvain")
sc.tl.leiden(adata, key_added="default_leiden")
We can also specify the number of clusters. SINFONIA will perform a binary search to tune the resolution parameter in clustering to make the number of clusters and the specified number as close as possible.
[9]:
adata = sinfonia.get_N_clusters(adata, n_cluster=adata.obs['label'].nunique(), cluster_method='louvain')
adata.obs['cluster_louvain'] = adata.obs['louvain']
adata = sinfonia.get_N_clusters(adata, n_cluster=adata.obs['label'].nunique(), cluster_method='leiden')
adata.obs['cluster_leiden'] = adata.obs['leiden']
Succeed to find 15 clusters at resolution 0.844.
Succeed to find 15 clusters at resolution 0.656.
Evaluating the performance for deciphering spatial domain¶
We first evaluate the performance for spatial clustering with default resolution in SCANPY.
[10]:
ami, ari, homo, nmi = sinfonia.clustering_metrics(adata, 'label', "default_louvain")
print('Louvain with default resolution:\nAMI: %.3f, \tARI: %.3f, \tHomo: %.3f, \tNMI: %.3f' % (ami, ari, homo, nmi))
Louvain with default resolution:
AMI: 0.890, ARI: 0.827, Homo: 0.906, NMI: 0.892
[11]:
ami, ari, homo, nmi = sinfonia.clustering_metrics(adata, 'label', "default_leiden")
print('Leiden with default resolution:\nAMI: %.3f, \tARI: %.3f, \tHomo: %.3f, \tNMI: %.3f' % (ami, ari, homo, nmi))
Leiden with default resolution:
AMI: 0.881, ARI: 0.787, Homo: 0.918, NMI: 0.883
We then evaluate the performance for spatial clustering with specified number of clusters.
[12]:
ami, ari, homo, nmi = sinfonia.clustering_metrics(adata, 'label', "cluster_louvain")
print('Louvain with searched resolution:\nAMI: %.3f, \tARI: %.3f, \tHomo: %.3f, \tNMI: %.3f' % (ami, ari, homo, nmi))
Louvain with searched resolution:
AMI: 0.877, ARI: 0.803, Homo: 0.885, NMI: 0.878
[13]:
ami, ari, homo, nmi = sinfonia.clustering_metrics(adata, 'label', "cluster_leiden")
print('Leiden with searched resolution:\nAMI: %.3f, \tARI: %.3f, \tHomo: %.3f, \tNMI: %.3f' % (ami, ari, homo, nmi))
Leiden with searched resolution:
AMI: 0.923, ARI: 0.901, Homo: 0.923, NMI: 0.924
Next, we evaluate the performance for domain resolution and latent representation.
[14]:
# We use rpy2 to run R packages from Python.
def LISI(coords, meta, label, perplexity=30, nn_eps=0):
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
from rpy2.robjects.packages import importr
importr("lisi")
if not isinstance(coords, pd.DataFrame):
coords = pd.DataFrame(coords)
if not isinstance(meta, pd.DataFrame):
meta = pd.DataFrame(meta)
meta = meta.loc[:, [label]]
meta[label] = meta[label].astype(str)
coords = robjects.conversion.py2rpy(coords)
meta = robjects.conversion.py2rpy(meta)
as_matrix = robjects.r["as.matrix"]
lisi = robjects.r["compute_lisi"](as_matrix(coords), meta, label, perplexity, nn_eps)
if isinstance(lisi, pd.DataFrame):
lisi = lisi.values
elif isinstance(lisi, np.recarray):
lisi = [item[0] for item in lisi]
return lisi
[15]:
MAP = sinfonia.mean_average_precision(adata.obsm["X_pca"].copy(), adata.obs['label'])
MCVA = sinfonia.mean_cross_validation_accuracy(adata.obsm["X_pca"].copy(), adata.obs['label'])
lisi = LISI(adata.obsm["X_pca"].copy(), adata.obs.copy(), 'label')
print("MAP: %.3f, \tMCVA: %.3f, \tiLISImd: %.3f, \tiLISIm:%.3f"%(MAP, MCVA, 1/np.median(lisi), 1/np.mean(lisi)))
MAP: 0.929, MCVA: 0.964, iLISImd: 0.943, iLISIm:0.811