单细胞测序数据整合（Seurat V4.0） vignettes

Introduction to scRNA-seq integration • Seurathttps://satijalab.org/seurat/articles/integration_introduction.html

Introduction to scRNA-seq integration

Compiled: January 11, 2022

Source: vignettes/integration_introduction.Rmd

基于R seurat v4.0的内置整合数据方法的R包进行的翻译学习

scRNA-seq整合简介

单细胞数据大量产出使得联合分析两个或多个单细胞数据集成为了独特的挑战。特别是在标准工作流下，识别存在于多个数据集中的细胞群可能会有问题。Seurat v4包含一组方法来匹配（或“对齐”）跨数据集的共享细胞群。这些方法首先识别处于匹配生物状态的跨数据集细胞对（“锚定”），既可用于校正数据集之间的技术差异（即批效应校正），也可用于对跨实验条件进行比较性scRNA序列分析。

ps：

锚定：选择相对保守的细胞群，计算样本间的差异成为迁移标准，继而将之用于校正数据集的差异。

下面，我们展示了Stuart*、Butler*等人2019年所述的scRNA-seq整合方法，以对静息状态或干扰素刺激状态下的人类免疫细胞（PBMC）进行比较分析。

整合目标

以下教程旨在向您概述使用Seurat整合程序可能对复杂细胞类型进行的比较分析。在这里，我们讨论几个关键目标：

构建“整合”数据以进行下游分析
确定两个数据集中存在的细胞类型
获得在对照细胞和干预细胞中都保守存在的细胞类型标记
比较数据集，以找出基于特定细胞类型对刺激所产生的特异性反应

设置Seurat对象

为了方便起见，我们使用SeuratData包的ifnb数据集。

library(Seurat)
library(SeuratData)
library(patchwork)

# install dataset
InstallData("ifnb")

# load dataset
LoadData("ifnb")

# split the dataset into a list of two seurat objects (stim and CTRL)
ifnb.list <- SplitObject(ifnb, split.by = "stim")

# normalize and identify variable features for each dataset independently
ifnb.list <- lapply(X = ifnb.list, FUN = function(x) {
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

# select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = ifnb.list)

执行整合

然后，我们使用FindIntegrationAnchors（）函数识别作为锚的基因，该函数构建一个作为Seurat对象的列表作为输入，并使用这些锚将两个数据集通过IntegratedData（）函数整合在一起。

immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, anchor.features = features)

# this command creates an 'integrated' data assay
immune.combined <- IntegrateData(anchorset = immune.anchors)

进行整合分析

现在我们可以对所有一个细胞文件进行整合分析了！

# specify that we will perform downstream analysis on the corrected data note that the
# original unmodified data still resides in the 'RNA' assay
DefaultAssay(immune.combined) <- "integrated"

# Run the standard workflow for visualization and clustering
immune.combined <- ScaleData(immune.combined, verbose = FALSE)
immune.combined <- RunPCA(immune.combined, npcs = 30, verbose = FALSE)
immune.combined <- RunUMAP(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindNeighbors(immune.combined, reduction = "pca", dims = 1:30)
immune.combined <- FindClusters(immune.combined, resolution = 0.5)

# Visualization
p1 <- DimPlot(immune.combined, reduction = "umap", group.by = "stim")
p2 <- DimPlot(immune.combined, reduction = "umap", label = TRUE, repel = TRUE)
p1 + p2

为了将这两种情况并排可视化，我们可以使用split.by。按参数分类显示每个不同颜色。

DimPlot(immune.combined, reduction = "umap", split.by = "stim")

识别保守的细胞类型标记

为了识别在不同条件下保守的典型细胞类型标记基因，我们提供FindConservedMarkers（）函数。该函数对每个数据集/组执行差异基因表达检验，并使用MetaDE R包中的荟萃分析方法所产生p值。例如，无论簇6（NK细胞）中的刺激条件

# For performing differential expression after integration, we switch back to the original
# data
DefaultAssay(immune.combined) <- "RNA"
nk.markers <- FindConservedMarkers(immune.combined, ident.1 = 6, grouping.var = "stim", verbose = FALSE)
head(nk.markers)

##        CTRL_p_val CTRL_avg_log2FC CTRL_pct.1 CTRL_pct.2 CTRL_p_val_adj
## GNLY            0        6.006173      0.944      0.045              0
## FGFBP2          0        3.243588      0.505      0.020              0
## CLIC3           0        3.461957      0.597      0.024              0
## PRF1            0        2.650548      0.422      0.017              0
## CTSW            0        2.987507      0.531      0.029              0
## KLRD1           0        2.777231      0.495      0.019              0
##           STIM_p_val STIM_avg_log2FC STIM_pct.1 STIM_pct.2 STIM_p_val_adj
## GNLY    0.000000e+00        5.858634      0.954      0.059   0.000000e+00
## FGFBP2 3.408448e-165        2.191113      0.261      0.015  4.789892e-161
## CLIC3   0.000000e+00        3.536367      0.623      0.030   0.000000e+00
## PRF1    0.000000e+00        4.094579      0.862      0.057   0.000000e+00
## CTSW    0.000000e+00        3.128054      0.592      0.035   0.000000e+00
## KLRD1   0.000000e+00        2.863797      0.552      0.027   0.000000e+00
##             max_pval minimump_p_val
## GNLY    0.000000e+00              0
## FGFBP2 3.408448e-165              0
## CLIC3   0.000000e+00              0
## PRF1    0.000000e+00              0
## CTSW    0.000000e+00              0
## KLRD1   0.000000e+00              0

如何，我们都可以计算出保守标记的基因。

我们可以为每个组探索这些标记基因，并使用它们将该组细胞注释为特定的细胞类型。

FeaturePlot(immune.combined, features = c("CD3D", "SELL", "CREM", "CD8A", "GNLY", "CD79A", "FCGR3A",
    "CCL2", "PPBP"), min.cutoff = "q9")

immune.combined <- RenameIdents(immune.combined, `0` = "CD14 Mono", `1` = "CD4 Naive T", `2` = "CD4 Memory T",
    `3` = "CD16 Mono", `4` = "B", `5` = "CD8 T", `6` = "NK", `7` = "T activated", `8` = "DC", `9` = "B Activated",
    `10` = "Mk", `11` = "pDC", `12` = "Eryth", `13` = "Mono/Mk Doublets", `14` = "HSPC")
DimPlot(immune.combined, label = TRUE)

带有拆分的DotPlot（）函数.by parameter可用于在不同条件下查看保守的细胞类型标记，显示表达任何给定基因的细胞类型中的表达水平和细胞百分比。在这里，我们为14个细胞类型中的每个细胞类型绘制了2-3个标记基因。

Idents(immune.combined) <- factor(Idents(immune.combined), levels = c("HSPC", "Mono/Mk Doublets",
    "pDC", "Eryth", "Mk", "DC", "CD14 Mono", "CD16 Mono", "B Activated", "B", "CD8 T", "NK", "T activated",
    "CD4 Naive T", "CD4 Memory T"))
markers.to.plot <- c("CD3D", "CREM", "HSPH1", "SELL", "GIMAP5", "CACYBP", "GNLY", "NKG7", "CCL5",
    "CD8A", "MS4A1", "CD79A", "MIR155HG", "NME1", "FCGR3A", "VMO1", "CCL2", "S100A9", "HLA-DQA1",
    "GPR183", "PPBP", "GNG11", "HBA2", "HBB", "TSPAN13", "IL3RA", "IGJ", "PRSS57")
DotPlot(immune.combined, features = markers.to.plot, cols = c("blue", "red"), dot.scale = 8, split.by = "stim") +
    RotatedAxis()

识别不同条件下的差异表达基因

现在我们已经将干预细胞和对照细胞对齐，我们可以开始进行比较分析，并观察刺激引起的差异。宏观观察这些变化的一种方法是绘制受刺激细胞和对照细胞的平均表达图，并在散点图上寻找异常的基因。在这里，我们取受刺激和对照原始T细胞以及CD14单核细胞群的平均表达，并生成散点图，突出显示对干扰素刺激表现出显著反应的基因。

library(ggplot2)
library(cowplot)
theme_set(theme_cowplot())
t.cells <- subset(immune.combined, idents = "CD4 Naive T")
Idents(t.cells) <- "stim"
avg.t.cells <- as.data.frame(log1p(AverageExpression(t.cells, verbose = FALSE)$RNA))
avg.t.cells$gene <- rownames(avg.t.cells)

cd14.mono <- subset(immune.combined, idents = "CD14 Mono")
Idents(cd14.mono) <- "stim"
avg.cd14.mono <- as.data.frame(log1p(AverageExpression(cd14.mono, verbose = FALSE)$RNA))
avg.cd14.mono$gene <- rownames(avg.cd14.mono)

genes.to.label = c("ISG15", "LY6E", "IFI6", "ISG20", "MX1", "IFIT2", "IFIT1", "CXCL10", "CCL8")
p1 <- ggplot(avg.t.cells, aes(CTRL, STIM)) + geom_point() + ggtitle("CD4 Naive T Cells")
p1 <- LabelPoints(plot = p1, points = genes.to.label, repel = TRUE)
p2 <- ggplot(avg.cd14.mono, aes(CTRL, STIM)) + geom_point() + ggtitle("CD14 Monocytes")
p2 <- LabelPoints(plot = p2, points = genes.to.label, repel = TRUE)
p1 + p2

正如你所见，许多相同的基因在这两种细胞类型中都上调，可能代表了一种保守的干扰素反应途径。

因为我们有信心在不同条件下识别出共同的细胞类型，所以我们可以发现相同类型的细胞在不同条件下的基因变化。首先，我们在meta.data中创建一个列，用于保存细胞类型和干预信息，并将当前标识切换到该列。然后我们使用FindMarkers（）来寻找刺激B细胞和对照B细胞之间不同的基因。请注意，这里显示的许多顶级基因与我们之前绘制的核心干扰素反应基因相同。此外，我们看到的单核细胞和B细胞干扰素反应特异性的CXCL10基因在该列表中也显示出差异的高度显著性。

immune.combined$celltype.stim <- paste(Idents(immune.combined), immune.combined$stim, sep = "_")
immune.combined$celltype <- Idents(immune.combined)
Idents(immune.combined) <- "celltype.stim"
b.interferon.response <- FindMarkers(immune.combined, ident.1 = "B_STIM", ident.2 = "B_CTRL", verbose = FALSE)
head(b.interferon.response, n = 15)

##                 p_val avg_log2FC pct.1 pct.2     p_val_adj
## ISG15   1.212995e-155  4.5997247 0.998 0.239 1.704622e-151
## IFIT3   4.743486e-151  4.5017769 0.964 0.052 6.666020e-147
## IFI6    1.680324e-150  4.2361116 0.969 0.080 2.361359e-146
## ISG20   1.595574e-146  2.9452675 1.000 0.671 2.242260e-142
## IFIT1   3.499460e-137  4.1278656 0.910 0.032 4.917791e-133
## MX1     8.571983e-121  3.2876616 0.904 0.115 1.204621e-116
## LY6E    1.359842e-117  3.1251242 0.895 0.152 1.910986e-113
## TNFSF10 4.454596e-110  3.7816677 0.790 0.025 6.260044e-106
## IFIT2   1.290640e-106  3.6584511 0.787 0.035 1.813736e-102
## B2M      2.019314e-95  0.6073495 1.000 1.000  2.837741e-91
## PLSCR1   1.464429e-93  2.8195675 0.794 0.117  2.057961e-89
## IRF7     3.893097e-92  2.5867694 0.837 0.190  5.470969e-88
## CXCL10   1.624151e-82  5.2608266 0.640 0.010  2.282419e-78
## UBE2L6   2.482113e-81  2.1450306 0.852 0.299  3.488114e-77
## PSMB9    5.977328e-77  1.6457686 0.940 0.571  8.399938e-73

另一种可视化基因表达变化的有用方法是分裂。通过FeaturePlot（）或VlnPlot（）函数的选项。这将显示给定基因列表的特征图，由分组变量（此处为刺激条件）分割。CD3D和GNLY等基因是典型的细胞类型标记（针对T细胞和NK/CD8 T细胞），几乎不受干扰素刺激的影响，并且在对照组和刺激组中显示类似的基因表达模式。另一方面，IFI6和ISG15是干扰素反应的核心基因，在所有细胞类型中都相应上调。最后，CD14和CXCL10是显示细胞类型特异性干扰素反应的基因。刺激CD14单核细胞后，CD14表达降低，这可能导致监督分析框架中的错误分类，强调了整合分析的价值。在干扰素刺激后，CXCL10在单核细胞和B细胞中显示出明显的上调，但在其他类型的细胞中没有。

FeaturePlot(immune.combined, features = c("CD3D", "GNLY", "IFI6"), split.by = "stim", max.cutoff = 3,
    cols = c("grey", "red"))

ots <- VlnPlot(immune.combined, features = c("LYZ", "ISG15", "CXCL10"), split.by = "stim", group.by = "celltype",
    pt.size = 0, combine = FALSE)
wrap_plots(plots = plots, ncol = 1)

基于SCTransform对数据集进行归一化并进行整合分析

2019年，在Hafemeister和Satija的研究中我们介绍了一种基于正则化负二项回归的scRNA序列归一化的改进方法。该方法名为“sctransform”，其避免了标准规范化工作流的一些缺陷，包括添加伪计数和对数(log)转换。你可以在文章或我们的sctransform vignette中阅读更多关于sctransform的内容。
下面，我们将演示如何为使用sctransform工作流规范化的数据集修改Seurat集成工作流。这些命令基本相似，但有几个关键区别：

在集成之前，通过SCTransform（）而不是NormalizeData（）单独归一化数据集
正如在我们的SCTransform中进一步讨论的，我们通常使用3000个或更多特征来进行SCTransform的下游分析。
在识别锚之前，运行PrepSCTIntegration（）函数
运行FindIntegrationAnchors（）和IntegrateData（）时，设置normalization.method参数方法参数设置为值SCT。
运行基于sctransform的工作流（包括整合分析）时，不要运行ScaleData（）函数

LoadData("ifnb")
ifnb.list <- SplitObject(ifnb, split.by = "stim")
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform)
features <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = 3000)
ifnb.list <- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features)

immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, normalization.method = "SCT",
    anchor.features = features)
immune.combined.sct <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT")

immune.combined.sct <- RunPCA(immune.combined.sct, verbose = FALSE)
immune.combined.sct <- RunUMAP(immune.combined.sct, reduction = "pca", dims = 1:30)

p1 <- DimPlot(immune.combined.sct, reduction = "umap", group.by = "stim")
p2 <- DimPlot(immune.combined.sct, reduction = "umap", group.by = "seurat_annotations", label = TRUE,
    repel = TRUE)
p1 + p2

现在数据集已经整合，您可以按照本部分中前面的步骤识别细胞类型和细胞类型特异性反应。

Session Info

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] cowplot_1.1.1                 ggplot2_3.3.5                
##  [3] patchwork_1.1.1               thp1.eccite.SeuratData_3.1.5 
##  [5] stxBrain.SeuratData_0.1.1     ssHippo.SeuratData_3.1.4     
##  [7] pbmcsca.SeuratData_3.0.0      pbmcMultiome.SeuratData_0.1.1
##  [9] pbmc3k.SeuratData_3.1.4       panc8.SeuratData_3.0.2       
## [11] ifnb.SeuratData_3.1.0         hcabm40k.SeuratData_3.0.0    
## [13] bmcite.SeuratData_0.3.0       SeuratData_0.2.1             
## [15] SeuratObject_4.0.4            Seurat_4.0.6                 
## 
## loaded via a namespace (and not attached):
##   [1] systemfonts_1.0.2     sn_2.0.0              plyr_1.8.6           
##   [4] igraph_1.2.11         lazyeval_0.2.2        splines_4.1.0        
##   [7] listenv_0.8.0         scattermore_0.7       TH.data_1.0-10       
##  [10] digest_0.6.29         htmltools_0.5.2       fansi_1.0.0          
##  [13] magrittr_2.0.1        memoise_2.0.0         tensor_1.5           
##  [16] cluster_2.1.2         ROCR_1.0-11           limma_3.48.0         
##  [19] globals_0.14.0        matrixStats_0.61.0    sandwich_3.0-1       
##  [22] pkgdown_1.6.1         spatstat.sparse_2.1-0 colorspace_2.0-2     
##  [25] rappdirs_0.3.3        ggrepel_0.9.1         rbibutils_2.2        
##  [28] textshaping_0.3.5     xfun_0.25             dplyr_1.0.7          
##  [31] crayon_1.4.2          jsonlite_1.7.2        spatstat.data_2.1-2  
##  [34] survival_3.2-11       zoo_1.8-9             glue_1.6.0           
##  [37] polyclip_1.10-0       gtable_0.3.0          leiden_0.3.9         
##  [40] future.apply_1.8.1    BiocGenerics_0.38.0   abind_1.4-5          
##  [43] scales_1.1.1          mvtnorm_1.1-2         DBI_1.1.1            
##  [46] miniUI_0.1.1.1        Rcpp_1.0.7            plotrix_3.8-1        
##  [49] metap_1.4             viridisLite_0.4.0     xtable_1.8-4         
##  [52] tmvnsim_1.0-2         reticulate_1.22       spatstat.core_2.3-2  
##  [55] stats4_4.1.0          htmlwidgets_1.5.4     httr_1.4.2           
##  [58] RColorBrewer_1.1-2    TFisher_0.2.0         ellipsis_0.3.2       
##  [61] ica_1.0-2             farver_2.1.0          pkgconfig_2.0.3      
##  [64] sass_0.4.0            uwot_0.1.11           deldir_1.0-6         
##  [67] utf8_1.2.2            tidyselect_1.1.1      labeling_0.4.2       
##  [70] rlang_0.4.12          reshape2_1.4.4        later_1.3.0          
##  [73] munsell_0.5.0         tools_4.1.0           cachem_1.0.6         
##  [76] cli_3.1.0             generics_0.1.1        mathjaxr_1.4-0       
##  [79] ggridges_0.5.3        evaluate_0.14         stringr_1.4.0        
##  [82] fastmap_1.1.0         yaml_2.2.1            ragg_1.1.3           
##  [85] goftest_1.2-3         knitr_1.33            fs_1.5.2             
##  [88] fitdistrplus_1.1-6    purrr_0.3.4           RANN_2.6.1           
##  [91] pbapply_1.5-0         future_1.23.0         nlme_3.1-152         
##  [94] mime_0.12             formatR_1.11          compiler_4.1.0       
##  [97] plotly_4.10.0         png_0.1-7             spatstat.utils_2.3-0 
## [100] tibble_3.1.6          bslib_0.3.1           stringi_1.7.6        
## [103] highr_0.9             desc_1.4.0            RSpectra_0.16-0      
## [106] lattice_0.20-44       Matrix_1.3-3          multtest_2.48.0      
## [109] vctrs_0.3.8           mutoss_0.1-12         pillar_1.6.4         
## [112] lifecycle_1.0.1       Rdpack_2.1.2          spatstat.geom_2.3-1  
## [115] lmtest_0.9-39         jquerylib_0.1.4       RcppAnnoy_0.0.19     
## [118] data.table_1.14.2     irlba_2.3.5           httpuv_1.6.5         
## [121] R6_2.5.1              promises_1.2.0.1      KernSmooth_2.23-20   
## [124] gridExtra_2.3         parallelly_1.30.0     codetools_0.2-18     
## [127] MASS_7.3-54           assertthat_0.2.1      rprojroot_2.0.2      
## [130] withr_2.4.3           mnormt_2.0.2          sctransform_0.3.2    
## [133] multcomp_1.4-17       mgcv_1.8-35           parallel_4.1.0       
## [136] grid_4.1.0            rpart_4.1-15          tidyr_1.1.4          
## [139] rmarkdown_2.10        Rtsne_0.15            Biobase_2.52.0       
## [142] numDeriv_2016.8-1.1   shiny_1.7.1

Developed by Paul Hoffman, Satija Lab and Collaborators.