Identifying SVGs on DLPFC data using SINFONIA¶

The following tutorial demonstrates how to use SINFONIA for identifying spatially variable genes (SVGs) on a human dorsolateral prefrontal cortex (DLPFC) dataset (Maynard, et al., 2021).

There are two parts in this tutorial:

Integrating SINFONIA into SCANPY. This part will show you how to seamlessly integrate SINFONIA into the SCANPY vignette for spatial transcriptomic data.
Evaluating the performance for deciphering spatial domain. This part will show you how to evaluate the performance of identified SVGs for deciphering spatial domain, and reproduce the results in the manuscript of SINFONIA.

[1]:

import numpy as np
import pandas as pd
import scanpy as sc
import sinfonia
import warnings
warnings.filterwarnings("ignore")

On a unix system, you can uncomment and execute the following command to download the DLPFC dataset in AnnData format.

[2]:

# !wget https://health.tsinghua.edu.cn/software/sinfonia/data/10X_DLPFC_151507.h5ad

[3]:

DLPFC_151507 = sc.read('10X_DLPFC_151507.h5ad')
DLPFC_151507

[3]:

AnnData object with n_obs × n_vars = 4221 × 33538
    obs: 'label'
    var: 'gene_id', 'gene_name'
    obsm: 'spatial'

Integrating SINFONIA into SCANPY¶

First, we set a random seed for reproducibility.

[4]:

sinfonia.setup_seed(2022)

We then follow the SCANPY vignette for spatial transcriptomic data to process the DLPFC dataset. In order to avoid subjective factors in quality control, we start from the count matrix.

We filter out the genes with zero counts, normalize and logarithmize the data.

[5]:

adata = DLPFC_151507.copy()
sc.pp.filter_genes(adata, min_cells=1)
sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)

Next, we detect SVGs via SINFONIA with the same n_top_genes as in the SCANPY vignette.

[6]:

adata = sinfonia.spatially_variable_genes(adata, n_top_genes=2000, subset=True)

The mode used to identify SVGs is stored in adata.uns['svg']. Boolean indicators of SVGs are stored in adata.var['spatially_variable']. Moran’s I scores of all the genes are stored in adata.var['moranI'], while rescaled Geary’s C scores of all the genes are stored in adata.var['gearyC'].

[7]:

adata

[7]:

View of AnnData object with n_obs × n_vars = 4221 × 3640
    obs: 'label'
    var: 'gene_id', 'gene_name', 'n_cells', 'spatially_variable', 'moranI', 'gearyC'
    uns: 'log1p', 'svg'
    obsm: 'spatial'

We then embed and cluster the manifold encoded by transcriptional similarity.

[8]:

sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.louvain(adata, key_added="default_louvain")
sc.tl.leiden(adata, key_added="default_leiden")

We can also specify the number of clusters. SINFONIA will perform a binary search to tune the resolution parameter in clustering to make the number of clusters and the specified number as close as possible.

[9]:

adata = sinfonia.get_N_clusters(adata, n_cluster=adata.obs['label'].nunique(), cluster_method='louvain')
adata.obs['cluster_louvain'] = adata.obs['louvain']
adata = sinfonia.get_N_clusters(adata, n_cluster=adata.obs['label'].nunique(), cluster_method='leiden')
adata.obs['cluster_leiden'] = adata.obs['leiden']

Succeed to find 7 clusters at resolution 1.125.
Succeed to find 7 clusters at resolution 0.750.

Evaluating the performance for deciphering spatial domain¶

We first evaluate the performance for spatial clustering with default resolution in SCANPY.

[10]:

ami, ari, homo, nmi = sinfonia.clustering_metrics(adata, 'label', "default_louvain")
print('Louvain with default resolution:\nAMI: %.3f, \tARI: %.3f, \tHomo: %.3f, \tNMI: %.3f' % (ami, ari, homo, nmi))

Louvain with default resolution:
AMI: 0.454,     ARI: 0.328,     Homo: 0.463,    NMI: 0.455

[11]:

ami, ari, homo, nmi = sinfonia.clustering_metrics(adata, 'label', "default_leiden")
print('Leiden with default resolution:\nAMI: %.3f, \tARI: %.3f, \tHomo: %.3f, \tNMI: %.3f' % (ami, ari, homo, nmi))

Leiden with default resolution:
AMI: 0.452,     ARI: 0.347,     Homo: 0.478,    NMI: 0.454

We then evaluate the performance for spatial clustering with specified number of clusters.

[12]:

ami, ari, homo, nmi = sinfonia.clustering_metrics(adata, 'label', "cluster_louvain")
print('Louvain with searched resolution:\nAMI: %.3f, \tARI: %.3f, \tHomo: %.3f, \tNMI: %.3f' % (ami, ari, homo, nmi))

Louvain with searched resolution:
AMI: 0.449,     ARI: 0.327,     Homo: 0.458,    NMI: 0.450

[13]:

ami, ari, homo, nmi = sinfonia.clustering_metrics(adata, 'label', "cluster_leiden")
print('Leiden with searched resolution:\nAMI: %.3f, \tARI: %.3f, \tHomo: %.3f, \tNMI: %.3f' % (ami, ari, homo, nmi))

Leiden with searched resolution:
AMI: 0.477,     ARI: 0.379,     Homo: 0.459,    NMI: 0.479

Next, we evaluate the performance for domain resolution and latent representation.

[14]:

# We use rpy2 to run R packages from Python.
def LISI(coords, meta, label, perplexity=30, nn_eps=0):
    import rpy2.robjects as robjects
    from rpy2.robjects import pandas2ri
    pandas2ri.activate()
    from rpy2.robjects.packages import importr
    importr("lisi")
    if not isinstance(coords, pd.DataFrame):
        coords = pd.DataFrame(coords)
    if not isinstance(meta, pd.DataFrame):
        meta = pd.DataFrame(meta)
    meta = meta.loc[:, [label]]
    meta[label] = meta[label].astype(str)

    coords = robjects.conversion.py2rpy(coords)
    meta = robjects.conversion.py2rpy(meta)
    as_matrix = robjects.r["as.matrix"]
    lisi = robjects.r["compute_lisi"](as_matrix(coords), meta, label, perplexity, nn_eps)
    if isinstance(lisi, pd.DataFrame):
        lisi = lisi.values
    elif isinstance(lisi, np.recarray):
        lisi = [item[0] for item in lisi]

    return lisi

[15]:

MAP = sinfonia.mean_average_precision(adata.obsm["X_pca"].copy(), adata.obs['label'])
MCVA = sinfonia.mean_cross_validation_accuracy(adata.obsm["X_pca"].copy(), adata.obs['label'])
lisi = LISI(adata.obsm["X_pca"].copy(), adata.obs.copy(), 'label')
print("MAP: %.3f, \tMCVA: %.3f, \tiLISImd: %.3f, \tiLISIm:%.3f"%(MAP, MCVA, 1/np.median(lisi), 1/np.mean(lisi)))

MAP: 0.657,     MCVA: 0.798,    iLISImd: 0.544,         iLISIm:0.515