单细胞数据挖掘 P 3.1 QC以及细胞周期判断

「生信技能树」单细胞数据挖掘_哔哩哔哩_bilibilihttps://www.bilibili.com/video/BV1pa4y1s76J?p=10

P3.1 QC空细胞及重复细胞质控以及细胞周期判断

空细胞质控

load("scRNA.Rdata")
library(Seurat)
##1、detect special cells----
# empty droplet
# BiocManager::install("DropletUtils")
library(DropletUtils)
e.out <- emptyDrops(GetAssayData(scRNA,slot="counts",assay="RNA"))
#Error in testEmptyDrops(m, lower = lower, ...) : 
#no counts available to estimate the ambient profile
##https://support.bioconductor.org/p/123554/#123562
#如上回答所说，empty droplet往往在第一步就已经过滤掉了，而一般上传到GEO的也都是过滤掉空液滴的。

先是空细胞，用dropletutils包

GEO数据往往以及完成了质控，因此返回的是个error值。

重复细胞质控（跑不通）

#double droplet
#https://osca.bioconductor.org/doublet-detection.html
# BiocManager::install("scran")
head(scRNA@meta.data)
library(scran)
library(Matrix)
#GetAssayData(scRNA,slot="counts",assay="RNA")[1:8,1:4]
?doubletCluster #检查有无double droplet聚在一起的类
db.test <- doubletCluster(GetAssayData(scRNA,slot="counts",assay="RNA"),
                          clusters=scRNA@meta.data$seurat_clusters)
head(db.test)
table(scRNA@meta.data$seurat_clusters)
library(scater)
chosen.doublet <- rownames(db.test)[isOutlier(db.test$N, 
                                              type="lower", log=TRUE)]
chosen.doublet #结果显示没有
#还有其它多种方法

这里有个问题，

> db.test <- doubletCluster(GetAssayData(scRNA,slot="counts",assay="RNA"),
+                           clusters=scRNA@meta.data$seurat_clusters)
Error in (function (..., row.names = NULL, check.names = TRUE, stringsAsFactors)  : 
  invalid length of row names
此外: There were 14 warnings (use warnings() to see them)

> length(GetAssayData(scRNA,slot="counts",assay="RNA"))
[1] 46950074

怀疑是scRNA的下标太大，超过了边界？
在这里需要说明这里回头需要探索一下解法，无法跑通。但是教程里跑通了，所以后续再看吧。
先记录方法2：
【单细胞测序攻略：二聚体过滤】DoubletDecon包过滤Seurat对象的二聚体（Doublet）_公卫小何的博客-CSDN博客_doublet 单细胞hhttps://blog.csdn.net/weixin_40561293/article/details/108962352回头再看看其他方法进行总结。

细胞周期基因

有些细胞如果进入细胞分裂期，他会高表达，会影响聚类时的分类。

因此要分开。

#细胞周期有关基因
?cc.genes
head(c(cc.genes$s.genes,cc.genes$g2m.genes))
length(c(cc.genes$s.genes,cc.genes$g2m.genes))
# [1] "MCM5" "PCNA" "TYMS" "FEN1" "MCM2" "MCM4"
#查看我们选择的高变基因中有哪些细胞周期相关基因,及打分
CaseMatch(c(cc.genes$s.genes,cc.genes$g2m.genes),VariableFeatures(scRNA))
#在scRNA@meta.data中添加S.Score、G2M.Score和Phase三列有关细胞周期的信息。
# 检查G2M期常用的marker基因，将之与scRNA进行匹配
g2m_genes = cc.genes$g2m.genes
g2m_genes = CaseMatch(search = g2m_genes, match = rownames(scRNA))
# 检查S期的marker基因，将之与scRNA进行匹配
s_genes = cc.genes$s.genes
s_genes = CaseMatch(search = s_genes, match = rownames(scRNA))
# 在seurat对象中基于这些基因进行细胞周期的评分，并储存。
scRNA <- CellCycleScoring(object=scRNA,  g2m.features=g2m_genes,  s.features=s_genes)
head(scRNA@meta.data)
#观察细胞周期相关基因是否影响聚类
scRNA <- RunPCA(scRNA, features = c(s_genes, g2m_genes))
p1 <- DimPlot(scRNA, reduction = "pca", group.by = "Phase")
# 这里p1几乎都聚在一起，因此认为他们的表达没有明显的偏移。
ggsave("3.2cell-cycle.pdf", plot = p1) 
#影响不大，基本重合在一起了

这里的思路是，通过cc.gene这个内置的包，就是包含了许多细胞周期的marker。

用这些marker基因来匹配我们的矩阵中的基因，将其表达量展现出来。看p1看是否细胞分布分的很开。

如果分得很开则认为他们差异大，则可能成为不同类别中的差异明显的基因进而影响细胞分类。