integrating single-cell datasets

Introduction

In this notebook I will go over several integration techniques for single-cell -omics data. These techniques are suitable for a variety of purposes, such as batch correction, cross-species analysis, and multi-modal analysis. Here are some brief explanations of the methods that I will be covering.

  • SCTransform
    • A normalization technique that uses regularized negative binomial regression to construct a corrected counts matrix whereby technical variation has been removed.
  • Harmony
    • Constructs a shared embedding across datasets using an iterative approach in which cells are grouped using soft k-means clustering, and cluster centroids are computed in order to apply dataset correction factors to each cluster.
  • Liger
    • Uses integrative Non-negative Matrix Factorization (i-NMF) to simultaneously integrate datasets and construct a low-dimensional data space. Additionally the Liger package offers factor specific marker gene analysis.
  • Batchelor
    • Uses a mutual-nearest-neighbors approach to apply dataset specific corrections to a low-dimensional data space such as PCA.
  • BBKNN
    • Similar to Harmony and Batchelor, Batch-balanced k-nearest neighbors (BBKNN) algorithm also applies dataset specific corrections to a low-dimensional data space such as PCA.

Load dataset

Before we do anything, we need to load our dataset. I am going to load the outputs of cellranger count rather than cellranger aggr, and combine into one Seurat object. We will then inspect quality metrics and filter out low-quality cells.

Plotting settings:

Check quality metrics and remove low-quality cells

Next, we inspect cell quality metrics such as the percentage of reads mapping to mitochondrial genes, the number of genes detected, and the total number of UMIs detected. Low quality cells are removed based on these criteria. Here you have some freedom to be strict or loose with your filtering criteria, and you have to be careful at this step since these criteria are highly dependent on your particular experiment, which is why we always inspect these distributions before deciding on quality cutoffs.

SCTransform

The first integration technique that I will demonstrate is SCTransform. Out of all of the integration algorithms that I will cover, this is the only one that augments the actual UMI counts matrix, while the others all operate using low-dimensional representations. In practice, SCTransform on its own does not always completely eliminate batch effects, but it works nicely in conjunction with the some of the other integration methods. This analysis is outlined in the following steps:

  • Run SCTransform to normalize UMI counts
  • Reduce deminsionality using PCA & UMAP
  • Clustering

As a warning, I found that SCTransform uses a tremendous amount of memory. On our lab server Zeus, I used over 100 GB of memory to run SCTransform on our 60k cell dataset.

Normalize data using SCTransform

The following block of code shows how to use SCTransform to correct the UMI matrix based on percent.mt and Library.Batch variables.

Harmony

The first method that we will use to find a low-dimensional data representation free from technical artifacts, we will use the Harmony algorithm. Harmony has a convenient wrapper for Seurat objects, and it requires that you have already performed PCA on your data.

Data visualization

Here we visualize clusters, sequencing batches, and other technical factors in a UMAP constructed from the Harmony corrected PCA.

SCT + PCA + Harmony + UMAP

SCT + PCA + Harmony + UMAP

Quality Control Feature plots

Quality Control Feature plots

Batch effects appear to be eliminated, however it is not so easy to tell by just looking at the UMAP. We will plot a stacked barplot showing the proportion of cells from each batch in each cluster.

Batches by cluster

Batches by cluster

We can see that each cluster is well represented by all three sequencing batches, and thus this processed dataset is suitable for downstream analysis of cell types and cell states.

Liger

The next data integration method that we will use is Liger, which uses integrative Non-Negative Matrix Factorization (i-NMF) to simultaneously reduce data dimensionality and correct technical factors.

Run Liger

There are two very important parameters for Liger. k determines the number of matrix factors in the factorized data, while lambda determines the degree of dataset integration, with a higher value mixing the datasets more thoroughly. Liger has functions to tune both of these parameters (suggestK and suggestLambda), but unfortunately in practice I have found that both of these functions take an incredible amount of runtime and memory. Additionaly, in practice I have found that k=30 and lambda=5 works well for most datasets. I commented out the code that would be used to find the k and lambda parameters.

Data visualization

Here we visualize seurat clusters, sequencing batches, technical factors, and cell type marker genes in the UMAP constructed from iNMF.

SCT + iNMF + UMAP

SCT + iNMF + UMAP

Quality Control Feature plots

Quality Control Feature plots

Marker gene feature plot

Marker gene feature plot

Batch effects appear to be eliminated, however it is not so easy to tell by just looking at the UMAP. We will plot a stacked barplot showing the proportion of cells from each batch in each cluster.

Batches by cluster

Batches by cluster

Batchelor

Here we use Batchelor to correct batches in a manner similar to Harmony. Batchelor is included with the R package monocle3, so we have to convert our Seurat object to a CellDataSet (CDS) object. Additionally, we have to find a way to convert the corrected matrix to a format that we can use in Seurat.

Data visualization

Finally, we visualize the UMAP and clusters constructed using the MNN corrected PCA matri from Batchelor.

SCT + PCA + MNN + UMAP

SCT + PCA + MNN + UMAP

Batch effects appear to be eliminated, however it is not so easy to tell by just looking at the UMAP. We will plot a stacked barplot showing the proportion of cells from each batch in each cluster.

Batches by cluster

Batches by cluster

BBKNN

Lastly, we will use BBKNN for batch correction. This method operates in a manner similar to Batchelor and Harmony, however this is the only method that I am demonstrating that is implemented in Python. First, we must reformat our data to work with Scanpy, the leading Python package for the analysis of single-cell data.

Batch correction using BBKNN

The following sections go over how to load the data into Scanpy, assuming that you already have it installed, and how to run BBKNN.

Run BBKNN

Running BBKNN has been made very simple using the scanpy external module. We visualize a UMAP colord by sequening batch made using the BBKNN corrected matrix.

SCT + PCA + BBKNN + UMAP

SCT + PCA + BBKNN + UMAP

This doesn’t appear to have taken care of all of the batch effects in this particular dataset, however this method still holds a lot of merit considering how fast it runs and that it is the only method for batch correction in Python. Perhaps this is just a tricky dataset and BBKNN would have better performance in other tissues / systems.

Sam Morabito

2020-06-23