integrating single-cell datasets

Introduction
- Load dataset
  - Check quality metrics and remove low-quality cells
SCTransform
Harmony
- Run Harmony
- Data visualization
Liger
Batchelor
BBKNN
- Reformat Seurat object for Scanpy
- Batch correction using BBKNN
  - Load data into Scanpy
  - Run BBKNN

Introduction

In this notebook I will go over several integration techniques for single-cell -omics data. These techniques are suitable for a variety of purposes, such as batch correction, cross-species analysis, and multi-modal analysis. Here are some brief explanations of the methods that I will be covering.

SCTransform
- A normalization technique that uses regularized negative binomial regression to construct a corrected counts matrix whereby technical variation has been removed.
Harmony
- Constructs a shared embedding across datasets using an iterative approach in which cells are grouped using soft k-means clustering, and cluster centroids are computed in order to apply dataset correction factors to each cluster.
Liger
- Uses integrative Non-negative Matrix Factorization (i-NMF) to simultaneously integrate datasets and construct a low-dimensional data space. Additionally the Liger package offers factor specific marker gene analysis.
Batchelor
- Uses a mutual-nearest-neighbors approach to apply dataset specific corrections to a low-dimensional data space such as PCA.
BBKNN
- Similar to Harmony and Batchelor, Batch-balanced k-nearest neighbors (BBKNN) algorithm also applies dataset specific corrections to a low-dimensional data space such as PCA.

Load dataset

Before we do anything, we need to load our dataset. I am going to load the outputs of cellranger count rather than cellranger aggr, and combine into one Seurat object. We will then inspect quality metrics and filter out low-quality cells.

library(Seurat)
library(tidyverse)

in_data_dir <- "/home/sam/AD_NucSeq_2019/cellranger_count_pre-mrna/"
fig_dir <- "/home/sam/AD_NucSeq_2019/batch_correction/liger/non-aggregated/figures/"
out_data_dir <- "/home/sam/AD_NucSeq_2019/batch_correction/liger/non-aggregated/data/"

samples <- dir(in_data_dir)

# create seurat object:
seurat_list <- lapply(samples, function(sample){
  cur_data <- Read10X(paste0(in_data_dir,sample,'/outs/filtered_feature_bc_matrix/'))
  cur_seurat <- CreateSeuratObject(
    counts = cur_data,
    min.cells=3,
    min.features=200,
    project='AD2020'
  )
  cur_seurat$SampleID <- sample
  return(cur_seurat)
})

# merge seurat object
seurat_obj <- merge(x=seurat_list[[1]], y=seurat_list[2:length(seurat_list)])

# clean up
rm(seurat_list)
gc()

# add metadata
sample_metadata <- read.csv(file = "/home/sam/AD_NucSeq_2019/batch_correction/liger/non-aggregated/data/metaData_snRNAseq_fix.csv")
rownames(sample_metadata) <- sample_metadata$Sample.ID
cell_metadata <- sample_metadata[seurat_obj$SampleID,]
for(meta in names(cell_metadata)){
  seurat_obj[[meta]] <- cell_metadata[[meta]]
}

Plotting settings:

library(cowplot)
theme_set(theme_cowplot())
library(RColorBrewer)
library(viridis)

umap_theme <- theme(
  axis.line=element_blank(),
  axis.text.x=element_blank(),
  axis.text.y=element_blank(),
  axis.ticks=element_blank(),
  axis.title.x=element_blank(),
  axis.title.y=element_blank(),
  panel.background=element_blank(),
  panel.border=element_blank(),
  panel.grid.major=element_blank(),
  panel.grid.minor=element_blank(),
  plot.background=element_blank()
)

Check quality metrics and remove low-quality cells

Next, we inspect cell quality metrics such as the percentage of reads mapping to mitochondrial genes, the number of genes detected, and the total number of UMIs detected. Low quality cells are removed based on these criteria. Here you have some freedom to be strict or loose with your filtering criteria, and you have to be careful at this step since these criteria are highly dependent on your particular experiment, which is why we always inspect these distributions before deciding on quality cutoffs.

seurat_obj[["percent.mt"]] <- PercentageFeatureSet(object = seurat_obj, pattern = "^MT-")

pdf(paste0(fig_dir, "qc_violin_plot.pdf"), width=10, height=10)
# png(paste0(fig_dir, "pngs/qc_violin_plot.png"), width=10, height=10, res=250, units='in')
VlnPlot(seurat_obj, group.by="SampleID", features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 2, pt.size=0, )
dev.off()

seurat_obj <- subset(seurat_obj, nFeature_RNA > 200 & nFeature_RNA < 10000 & percent.mt < 10)
seurat_obj <- subset(seurat_obj, SampleID != 'Sample-101')

SCTransform

The first integration technique that I will demonstrate is SCTransform. Out of all of the integration algorithms that I will cover, this is the only one that augments the actual UMI counts matrix, while the others all operate using low-dimensional representations. In practice, SCTransform on its own does not always completely eliminate batch effects, but it works nicely in conjunction with the some of the other integration methods. This analysis is outlined in the following steps:

Run SCTransform to normalize UMI counts
Reduce deminsionality using PCA & UMAP
Clustering

As a warning, I found that SCTransform uses a tremendous amount of memory. On our lab server Zeus, I used over 100 GB of memory to run SCTransform on our 60k cell dataset.

Normalize data using SCTransform

The following block of code shows how to use SCTransform to correct the UMI matrix based on percent.mt and Library.Batch variables.

library(sctransform)

seurat_obj <- SCTransform(
  seurat_obj,
  vars.to.regress = c('percent.mt', 'Library.Batch'),
  verbose=TRUE
)
gc()

saveRDS(seurat_obj, file=paste0(out_data_dir, 'seurat_object_sct.rds'))

Data processing

In the following block of code, we process the normalized counts matrix using PCA, UMAP, and clustering. We will visualize the top genes in the first 12 PCs to demonstrate their ability to separate cells in the data. Additionally we will visualize the standard deviation of the top 20 PCs.

# compute PCA:
seurat_obj <- RunPCA(seurat_obj)

# plot pca heatmap
png(paste0(fig_dir, "pngs/pca_heatmap.png"), width=10, height=10, res=300, units='in')
DimHeatmap(seurat_obj, dims = 1:12, cells = 500, balanced = TRUE)
dev.off()

# plot variance explained by PC:
png(paste0(fig_dir, "pngs/pca_elbow_plot.png"), width=5, height=3, res=300, units='in')
ElbowPlot(seurat_obj)
dev.off()

# UMAP and clustering with top PCs
seurat_obj <- RunUMAP(seurat_obj, reduction='pca', dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, reduction='pca')
seurat_obj <- FindClusters(seurat_obj, resolution = 0.3)

PCA heatmap

PCA elbow

Data visualization

Here we visualize seurat cluster assignments and library batch of origin as UMAP plots.

pdf(paste0(fig_dir, "umap_clusters_sct.pdf"), width=7, height=7)
p1 <- DimPlot(seurat_obj, group.by='seurat_clusters', reduction='umap') +
  ggtitle('seurat clusters') + umap_theme
print(p1)
dev.off()

pdf(paste0(fig_dir, "umap_batch_sct.pdf"), width=7, height=7)
p2 <- DimPlot(seurat_obj, group.by='Library.Batch', reduction='umap') +
  ggtitle('Sequencing Batch') + umap_theme
print(p2)
dev.off()

png(paste0(fig_dir, "pngs/umap_sct_only.png"), width=12, height=6, res=250, units='in')
CombinePlots(list(p1,p2))
dev.off()

SCT + PCA + UMAP

It is quite evident that in low-dimensional space that batch effects are still present.

Harmony

The first method that we will use to find a low-dimensional data representation free from technical artifacts, we will use the Harmony algorithm. Harmony has a convenient wrapper for Seurat objects, and it requires that you have already performed PCA on your data.

Run Harmony

The following block of code demonstrates how to run harmony on the SCT assay to correct for the Library.Batch variable, followed by UMAP and clustering.

library(harmony)

seurat_obj <- seurat_obj %>%
  RunHarmony("Library.Batch", assay.use="SCT")

# UMAP and clustering with harmonized PCs
seurat_obj <- RunUMAP(seurat_obj, reduction='harmony', dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, reduction='harmony')
seurat_obj <- FindClusters(seurat_obj, resolution = 0.3)

# save
saveRDS(seurat_obj, file=paste0(out_data_dir, 'seurat_object_sct_harmony.rds'))

Data visualization

Here we visualize clusters, sequencing batches, and other technical factors in a UMAP constructed from the Harmony corrected PCA.

pdf(paste0(fig_dir, "umap_clusters_sct_harmony.pdf"), width=7, height=7)
p1 <- DimPlot(seurat_obj, group.by='seurat_clusters', reduction='umap') +
  ggtitle('seurat clusters') + umap_theme
print(p1)
dev.off()

pdf(paste0(fig_dir, "umap_batch_sct_harmony.pdf"), width=7, height=7)
p2 <- DimPlot(seurat_obj, group.by='Library.Batch', reduction='umap') +
  ggtitle('Sequencing Batch') + umap_theme
print(p2)
dev.off()

png(paste0(fig_dir, "pngs/umap_sct_harmony.png"), width=12, height=6, res=250, units='in')
CombinePlots(list(p1,p2))
dev.off()


plot_list <- list()
feats <- c('nFeature_RNA', 'percent.mt')
for(i in 1:length(feats)){
  plot_list[[i]] <- FeaturePlot(seurat_obj, features=c(feats[[i]]), reduction='umap') + umap_theme
}

pdf(paste0(fig_dir, "umap_qc_sct_harmony.pdf"), width=12, height=6)
png(paste0(fig_dir, "pngs/umap_qc_sct_harmony.png"), width=12, height=6, res=250, units='in')
CombinePlots(plot_list)
dev.off()

SCT + PCA + Harmony + UMAP

Quality Control Feature plots

Batch effects appear to be eliminated, however it is not so easy to tell by just looking at the UMAP. We will plot a stacked barplot showing the proportion of cells from each batch in each cluster.

cluster_var <- 'seurat_clusters'
clusters <- unique(seurat_obj@meta.data[[cluster_var]])
clusters <- clusters[order(clusters)]
df <- data.frame()
for(i in 1:length(clusters)){
  cur_df <- as.data.frame(seurat_obj@meta.data %>% subset(seurat_clusters == clusters[i]) %>% .$Library.Batch %>% table() /
  table(seurat_obj$Library.Batch))

  cur_df$Freq <- cur_df$Freq * 1/(sum(cur_df$Freq))

  cur_df$cluster <- clusters[i]
  df <- rbind(df, cur_df)
  print(i)
}

pdf(paste0(fig_dir, 'barplot_batches_by_cluster.pdf'), height=4, width=8)
png(paste0(fig_dir, 'pngs/barplot_batches_by_cluster.png'), height=4, width=8, res=250, units='in')
p <- ggplot(df, aes(y=Freq, x=cluster, fill=.)) +
  geom_bar(stat='identity') +
  scale_y_continuous(expand = c(0,0)) +
  ylab('normalized proportion') +
  theme(
    panel.grid.major=element_blank(),
    panel.grid.minor=element_blank(),
    axis.text.x = element_text(angle=45, hjust=1),
    axis.title.x = element_blank(),
    legend.title = element_blank(),
    axis.line.y = element_blank(),
    axis.line.x = element_blank()
  )
print(p)
dev.off()

Batches by cluster

We can see that each cluster is well represented by all three sequencing batches, and thus this processed dataset is suitable for downstream analysis of cell types and cell states.

Liger

The next data integration method that we will use is Liger, which uses integrative Non-Negative Matrix Factorization (i-NMF) to simultaneously reduce data dimensionality and correct technical factors.

Run Liger

There are two very important parameters for Liger. k determines the number of matrix factors in the factorized data, while lambda determines the degree of dataset integration, with a higher value mixing the datasets more thoroughly. Liger has functions to tune both of these parameters (suggestK and suggestLambda), but unfortunately in practice I have found that both of these functions take an incredible amount of runtime and memory. Additionaly, in practice I have found that k=30 and lambda=5 works well for most datasets. I commented out the code that would be used to find the k and lambda parameters.

library(liger)
source("sc_utilities.R")

# isolate SCT corrected counts matrices from each batch and format in a list:
SeuratList <- list(
  b1 = GetAssayData(subset(seurat_obj, subset = Library.Batch == 1), slot="counts", assay='SCT'),
  b2 = GetAssayData(subset(seurat_obj, subset = Library.Batch == 2), slot="counts", assay='SCT'),
  b3 = GetAssayData(subset(seurat_obj, subset = Library.Batch == 3), slot="counts", assay='SCT')
)

# create liger object:
a.NucSeq <- createLiger(SeuratList)

# normalize, select variable genes, scale
a.NucSeq <- normalize(a.NucSeq)
pdf("figures/liger_variable_genes.pdf", width=8, height=8)
a.NucSeq <- selectGenes(a.NucSeq, var.thresh =c(0.38, 0.4, 0.35), do.plot=T)
dev.off()
a.NucSeq <- scaleNotCenter(a.NucSeq)

# select k (not running because it takes a long time)
# k.suggest <- suggestK(a.NucSeq, num.cores=16, gen.new=T, plot.log2=F)
# pdf("figures/liger_k_suggest.pdf", width=10, height=8)
# print(k.suggest)
# dev.off()

# lambda.suggest <- suggestLambda(a.NucSeq, num.cores=32, gen.new=F, k=30)

kval = 30
a.NucSeq <- optimizeALS(a.NucSeq, k=kval)
a.NucSeq <- quantileAlignSNF(a.NucSeq, resolution = 1.0, small.clust.thresh = 20)

# add inmf to seurat object
seurat_obj@reductions$iNMF <- CreateDimReducObject(
  embeddings = a.NucSeq@H.norm[colnames(seurat_obj),],
  loadings = t(a.NucSeq@W ),
  key= "iNMF",
  assay="SCT"
)

Data processing using iNMF

Here we run UMAP and clustering using the iNMF matrix.

# UMAP and clustering using iNMF
seurat_obj <- RunUMAP(seurat_obj, reduction='iNMF', dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, reduction='iNMF')
seurat_obj <- FindClusters(seurat_obj, resolution = 0.3)

saveRDS(seurat_obj, file=paste0(out_data_dir, 'seurat_object_sct_inmf.rds'))

Data visualization

Here we visualize seurat clusters, sequencing batches, technical factors, and cell type marker genes in the UMAP constructed from iNMF.

pdf(paste0(fig_dir, "umap_clusters_sct_inmf.pdf"), width=7, height=7)
p1 <- DimPlot(seurat_obj, group.by='seurat_clusters', reduction='umap') +
  ggtitle('seurat clusters') + umap_theme
print(p1)
dev.off()

pdf(paste0(fig_dir, "umap_batch_sct_inmf.pdf"), width=7, height=7)
p2 <- DimPlot(seurat_obj, group.by='Library.Batch', reduction='umap') +
  ggtitle('Sequencing Batch') + umap_theme
print(p2)
dev.off()

png(paste0(fig_dir, "pngs/umap_sct_inmf.png"), width=12, height=6, res=250, units='in')
CombinePlots(list(p1,p2))
dev.off()

plot_list <- list()
feats <- c('nFeature_RNA', 'percent.mt')
for(i in 1:length(feats)){
  plot_list[[i]] <- FeaturePlot(seurat_obj, features=c(feats[[i]]), reduction='umap') + umap_theme
}

pdf(paste0(fig_dir, "umap_qc_sct_inmf.pdf"), width=12, height=6)
# png(paste0(fig_dir, "pngs/umap_qc_sct_inmf.png"), width=12, height=6, res=250, units='in')
CombinePlots(plot_list)
dev.off()

plot_list <- list()
markers <- c('CSF1R', 'MBP', 'GFAP', 'GAD2', 'SLC17A7', 'PDGFRA')
for(i in 1:length(markers)){
  plot_list[[i]] <- FeaturePlot(seurat_obj, features=c(markers[[i]]), reduction='umap') + umap_theme
}

png(paste0(fig_dir, "pngs/umap_markers_sct_inmf.png"), width=12, height=8, units='in', res=300)
CombinePlots(plot_list, ncol=3)
dev.off()

SCT + iNMF + UMAP

Quality Control Feature plots

Marker gene feature plot

Batch effects appear to be eliminated, however it is not so easy to tell by just looking at the UMAP. We will plot a stacked barplot showing the proportion of cells from each batch in each cluster.

cluster_var <- 'seurat_clusters'
clusters <- unique(seurat_obj@meta.data[[cluster_var]])
clusters <- clusters[order(clusters)]
df <- data.frame()
for(i in 1:length(clusters)){
  cur_df <- as.data.frame(seurat_obj@meta.data %>% subset(seurat_clusters == clusters[i]) %>% .$Library.Batch %>% table() /
  table(seurat_obj$Library.Batch))

  cur_df$Freq <- cur_df$Freq * 1/(sum(cur_df$Freq))

  cur_df$cluster <- clusters[i]
  df <- rbind(df, cur_df)
  print(i)
}

pdf(paste0(fig_dir, 'barplot_batches_by_cluster_inmf.pdf'), height=4, width=8)
# png(paste0(fig_dir, 'pngs/barplot_batches_by_cluster_inmf.png'), height=4, width=8, res=250, units='in')
p <- ggplot(df, aes(y=Freq, x=cluster, fill=.)) +
  geom_bar(stat='identity') +
  scale_y_continuous(expand = c(0,0)) +
  ylab('normalized proportion') +
  theme(
    panel.grid.major=element_blank(),
    panel.grid.minor=element_blank(),
    axis.text.x = element_text(angle=45, hjust=1),
    axis.title.x = element_blank(),
    legend.title = element_blank(),
    axis.line.y = element_blank(),
    axis.line.x = element_blank()
  )
print(p)
dev.off()

Batches by cluster

Batchelor

Here we use Batchelor to correct batches in a manner similar to Harmony. Batchelor is included with the R package monocle3, so we have to convert our Seurat object to a CellDataSet (CDS) object. Additionally, we have to find a way to convert the corrected matrix to a format that we can use in Seurat.

Convert Seurat to CDS and run Batchelor

First I will demonstrate how to convert from a Seurat object to CDS.

# moving over to HPC so re-load some stuff
library(Seurat)
library(tidyverse)
library(monocle3)

fig_dir <- "/dfs3/swaruplab/smorabit/analysis/AD_NucSeq_2019/batch_correction/liger/non-aggregated/figures/"
seurat_obj <- readRDS('data/seurat_object_sct_inmf.rds')

# convert to CDS object:
library(monocle3)
setwd('monocle')

# attempt to make a cell dataset object:
expr_matrix <- GetAssayData(seurat_obj, slot='counts', assay='SCT')
genes <- data.frame(as.character(rownames(expr_matrix)))
rownames(genes) <- rownames(expr_matrix)
genes <- as.data.frame(cbind(genes,genes))
colnames(genes) <- c("GeneSymbol", "gene_short_name")
cds <- new_cell_data_set(
  expr_matrix,
  cell_metadata=seurat_obj@meta.data,
  gene_metadata=genes
)

cds@reducedDims[['PCA']] <- seurat_obj[['pca']]@cell.embeddings

Run Batchelor and convert to Seurat

Next, we will run Batchelor using the align_cds function, and we will add this corrected matrix to our Seurat object.

cds <- align_cds(cds, preprocess_method="PCA", alignment_group = "Library.Batch")

# move aligned matrix back to seurat object:
monocle_aligned <- cds@reducedDims[["Aligned"]]
rownames(monocle_aligned) <- colnames(cds)
colnames(monocle_aligned) <- paste0('MNN_', seq(1:ncol(monocle_aligned)))
all.equal(rownames(monocle_aligned), colnames(seurat_obj))
seurat_obj@reductions$aligned <- CreateDimReducObject(
  embeddings=monocle_aligned,
  key="MNN_",
  assay="SCT"
)

Data visualization

Finally, we visualize the UMAP and clusters constructed using the MNN corrected PCA matri from Batchelor.

# compute umap using MNN
# UMAP and clustering with top PCs
seurat_obj <- RunUMAP(seurat_obj, reduction='aligned', dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, reduction='aligned')
seurat_obj <- FindClusters(seurat_obj, resolution = 0.3)

pdf(paste0(fig_dir, "umap_clusters_mnn.pdf"), width=7, height=7)
p1 <- DimPlot(seurat_obj, group.by='seurat_clusters', reduction='umap') +
 ggtitle('seurat clusters') + umap_theme
print(p1)
dev.off()

pdf(paste0(fig_dir, "umap_batch_mnn.pdf"), width=7, height=7)
p2 <- DimPlot(seurat_obj, group.by='Library.Batch', reduction='umap') +
 ggtitle('Sequencing Batch') + umap_theme
print(p2)
dev.off()

pdf(paste0(fig_dir, "umap_sample_mnn.pdf"), width=7, height=7)
DimPlot(seurat_obj, group.by='SampleID', reduction='umap') +
 ggtitle('Sample') + umap_theme
dev.off()

pdf(paste0(fig_dir, "umap_diagnosis_mnn.pdf"), width=7, height=7)
DimPlot(seurat_obj, group.by='Diagnosis', reduction='umap') +
 ggtitle('Diagnosis') + umap_theme
dev.off()

png(paste0(fig_dir, "pngs/umap_mnn.png"), width=12, height=6, res=250, units='in')
CombinePlots(list(p1,p2))
dev.off()

SCT + PCA + MNN + UMAP

Batch effects appear to be eliminated, however it is not so easy to tell by just looking at the UMAP. We will plot a stacked barplot showing the proportion of cells from each batch in each cluster.

cluster_var <- 'seurat_clusters'
clusters <- unique(seurat_obj@meta.data[[cluster_var]])
clusters <- clusters[order(clusters)]
df <- data.frame()
for(i in 1:length(clusters)){
  cur_df <- as.data.frame(seurat_obj@meta.data %>% subset(seurat_clusters == clusters[i]) %>% .$Library.Batch %>% table() /
  table(seurat_obj$Library.Batch))

  cur_df$Freq <- cur_df$Freq * 1/(sum(cur_df$Freq))

  cur_df$cluster <- clusters[i]
  df <- rbind(df, cur_df)
  print(i)
}

pdf(paste0(fig_dir, 'barplot_batches_by_cluster_mnn.pdf'), height=4, width=8)
png(paste0(fig_dir, 'pngs/barplot_batches_by_cluster_mnn.png'), height=4, width=8, res=250, units='in')
p <- ggplot(df, aes(y=Freq, x=cluster, fill=.)) +
  geom_bar(stat='identity') +
  scale_y_continuous(expand = c(0,0)) +
  ylab('normalized proportion') +
  theme(
    panel.grid.major=element_blank(),
    panel.grid.minor=element_blank(),
    axis.text.x = element_text(angle=45, hjust=1),
    axis.title.x = element_blank(),
    legend.title = element_blank(),
    axis.line.y = element_blank(),
    axis.line.x = element_blank()
  )
print(p)
dev.off()

Batches by cluster

BBKNN

Lastly, we will use BBKNN for batch correction. This method operates in a manner similar to Batchelor and Harmony, however this is the only method that I am demonstrating that is implemented in Python. First, we must reformat our data to work with Scanpy, the leading Python package for the analysis of single-cell data.

Reformat Seurat object for Scanpy

BBKNN is implemented in Python in the Scanpy package, so we have to format our data in a way that will be understood by Scanpy.

# save as .mtx so we can use this for scanpy/paga!

# save metadata:
seurat_obj$barcode <- colnames(seurat_obj)
seurat_obj$UMAP_1 <- seurat_obj@reductions$umap@cell.embeddings[,1]
seurat_obj$UMAP_2 <- seurat_obj@reductions$umap@cell.embeddings[,2]
write.table(seurat_obj@meta.data, file='data/seurat_metadata.tsv', quote=F, row.names=F, sep='\t')

# write sct matrix
library(Matrix)
sct_matrix <- GetAssayData(seurat_obj, assay='SCT', slot='data')
writeMM(sct_matrix, file='data/seurat_sct.mtx')

# write harmonized PCs
write.csv(seurat_obj@reductions$pca@cell.embeddings, file='data/seurat_pca.csv', quote=F, row.names=F)

# write gene names
write.table(
  data.frame('gene'=rownames(sct_matrix)),
  file='data/seurat_gene_names.csv',
  quote=F,
  row.names=F,
  col.names=F
)

Batch correction using BBKNN

The following sections go over how to load the data into Scanpy, assuming that you already have it installed, and how to run BBKNN.

Load data into Scanpy

First, load the data into Scanpy.

import scanpy as sc
import anndata
from scipy import io
from scipy.sparse import coo_matrix, csr_matrix
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import os
import pandas as pd

data_dir = "/home/sam/AD_NucSeq_2019/batch_correction/liger/non-aggregated/data/"

# load sparse matrix:
X = io.mmread("{}/seurat_sct.mtx".format(data_dir))

# create anndata object
adata = anndata.AnnData(
    X=X.transpose().tocsr()
)

# load sample metadata:
sample_meta = pd.read_csv("{}seurat_metadata.tsv".format(data_dir), sep='\t')

# load gene names:
with open("{}/seurat_gene_names.csv".format(data_dir), 'r') as f:
    gene_names = f.read().splitlines()

adata.obs = sample_meta
adata.obs.index = adata.obs['barcode']
adata.var.index = gene_names

# load harmony:
pca = pd.read_csv("{}seurat_pca.csv".format(data_dir))
pca.index = adata.obs.index

# set pca and umap
adata.obsm['X_pca'] = pca.to_numpy()
adata.obsm['X_umap'] = np.vstack((adata.obs['UMAP_1'].to_numpy(), adata.obs['UMAP_2'].to_numpy())).T

# convert certain variables to categorical:
# clusters, partitions, batches
adata.obs['Library.Batch'] = adata.obs['Library.Batch'].apply(str)
adata.obs['seurat_clusters'] = adata.obs['seurat_clusters'].apply(str)

Run BBKNN

Running BBKNN has been made very simple using the scanpy external module. We visualize a UMAP colord by sequening batch made using the BBKNN corrected matrix.

# perform bbknn
sc.external.pp.bbknn(adata, batch_key='Library.Batch')

# re-compute UMAP
sc.tl.umap(adata)

# plot UMAP
fig, axes = plt.subplots(figsize=(6,6))
sc.pl.umap(
  adata, color=['Library.Batch'],
  title='BBKNN batches', ax=axes,
  frameon=False
)
plt.savefig('figures/pngs/umap_bbknn.png', dpi=500)

$SCT + PCA + BBKNN + UMAP$

SCT + PCA + BBKNN + UMAP

This doesn’t appear to have taken care of all of the batch effects in this particular dataset, however this method still holds a lot of merit considering how fast it runs and that it is the only method for batch correction in Python. Perhaps this is just a tricky dataset and BBKNN would have better performance in other tissues / systems.