☰ Menu

      Single Cell RNA-Seq Analysis

Home
Introduction and Lectures
Intro to the Workshop
Daily Tasks
Support
Cheat Sheets
Software and Links
Scripts
Prerequisites, R
R
scRNAseq Analysis
Prepare scRNAseq Analysis
scRNAseq Analysis - PART1
scRNAseq Analysis - PART2
scRNAseq Analysis - PART3
scRNAseq Analysis - PART4
scRNAseq Analysis - PART5
scRNAseq Analysis - PART6
scRNAseq Analysis - PART7
Prerequisites, CLI
CLI
Data Reduction
Files and Filetypes
Project setup
Generating Expression Matrix
ETC
Closing thoughts
Discussions Page
Github page

Last Updated: July 14, 2022

Part 2: Some QA/QC, filtering and normalization

Load libraries

library(Seurat)
library(biomaRt)
library(ggplot2)
library(knitr)
library(kableExtra)

Load the Seurat object from part 1

load(file="original_seurat_object.RData")
experiment.aggregate
## An object of class Seurat 
## 21005 features across 30902 samples within 1 assay 
## Active assay: RNA (21005 features, 0 variable features)
set.seed(12345)

Some basic QA/QC of the metadata, print tables of the 5% quantiles.

Show 5% quantiles for number of genes per cell per sample

kable(do.call("cbind", tapply(experiment.aggregate$nFeature_RNA, 
                      Idents(experiment.aggregate),quantile,probs=seq(0,1,0.05))),
      caption = "5% Quantiles of Genes/Cell by Sample") %>% kable_styling()
5% Quantiles of Genes/Cell by Sample
A001-C-007 A001-C-104 B001-A-301
0% 298.0 297 298.00
5% 306.0 308 302.00
10% 315.0 321 304.00
15% 325.0 336 307.00
20% 339.0 351 310.00
25% 356.0 367 312.00
30% 380.0 385 316.00
35% 406.0 406 319.00
40% 437.0 430 323.00
45% 482.0 465 327.00
50% 534.0 515 332.00
55% 616.0 574 338.00
60% 701.6 645 345.00
65% 787.0 737 355.00
70% 916.0 869 373.00
75% 1073.0 1013 429.00
80% 1264.4 1179 550.00
85% 1543.4 1413 868.00
90% 1961.4 1778 1356.00
95% 2871.9 2400 1979.35
100% 11933.0 11925 8761.00

Show 5% quantiles for number of UMI per cell per sample

kable(do.call("cbind", tapply(experiment.aggregate$nCount_RNA, 
                                      Idents(experiment.aggregate),quantile,probs=seq(0,1,0.05))),
      caption = "5% Quantiles of UMI/Cell by Sample") %>% kable_styling()
5% Quantiles of UMI/Cell by Sample
A001-C-007 A001-C-104 B001-A-301
0% 320.0 316 317.00
5% 343.0 347 341.00
10% 356.0 363 346.00
15% 369.0 381 350.00
20% 385.0 400 353.00
25% 407.0 422 357.00
30% 436.0 444 361.00
35% 469.0 468 366.00
40% 508.0 500 370.00
45% 564.7 543 376.00
50% 640.0 606 382.00
55% 740.0 685 389.00
60% 856.6 783 399.00
65% 992.8 912 411.00
70% 1171.0 1098 432.00
75% 1428.0 1328 501.00
80% 1737.0 1590 662.00
85% 2179.1 1991 1118.00
90% 3026.2 2720 1970.40
95% 5176.5 4213 3338.45
100% 150669.0 148953 89690.00

Show 5% quantiles for number of mitochondrial percentage per cell per sample

kable(round(do.call("cbind", tapply(experiment.aggregate$percent.mito, Idents(experiment.aggregate),quantile,probs=seq(0,1,0.05))), digits = 3),
      caption = "5% Quantiles of Percent Mitochondria by Sample") %>% kable_styling()
5% Quantiles of Percent Mitochondria by Sample
A001-C-007 A001-C-104 B001-A-301
0% 0.000 0.000 0.000
5% 0.245 0.247 0.180
10% 0.340 0.402 0.297
15% 0.418 0.543 0.491
20% 0.504 0.703 0.748
25% 0.597 0.856 0.946
30% 0.699 1.024 1.130
35% 0.812 1.222 1.292
40% 0.931 1.425 1.418
45% 1.068 1.686 1.542
50% 1.245 1.976 1.676
55% 1.422 2.345 1.774
60% 1.624 2.836 1.913
65% 1.863 3.297 2.020
70% 2.164 3.792 2.151
75% 2.493 4.199 2.286
80% 2.871 4.582 2.439
85% 3.306 5.000 2.604
90% 3.869 5.542 2.841
95% 4.798 6.266 3.176
100% 46.647 31.370 19.728

Violin plot of 1) number of genes, 2) number of UMI and 3) percent mitochondrial genes

VlnPlot(
  experiment.aggregate,
  features = c("nFeature_RNA", "nCount_RNA","percent.mito"),
  ncol = 1, pt.size = 0.3)

plot ridge plots of the same data

RidgePlot(experiment.aggregate, features=c("nFeature_RNA","nCount_RNA", "percent.mito"), ncol = 2)

Plot the distribution of number of cells each gene is represented by.

plot(sort(Matrix::rowSums(GetAssayData(experiment.aggregate) >= 3), decreasing = TRUE) , xlab="gene rank", ylab="number of cells", main="Cells per genes (reads/gene >= 3 )")

Gene Plot, scatter plot of gene expression across cells, (colored by sample), drawing horizontal an verticale lines at proposed filtering cutoffs.

FeatureScatter(experiment.aggregate, feature1 = "nCount_RNA", feature2 = "percent.mito", shuffle = TRUE) + geom_vline(xintercept = c(600,12000)) + geom_hline(yintercept = 8)

FeatureScatter(experiment.aggregate, feature1 = "nFeature_RNA", feature2 = "percent.mito", shuffle = TRUE) + geom_vline(xintercept = c(400,4000)) + geom_hline(yintercept = 8)

FeatureScatter(
  experiment.aggregate, "nCount_RNA", "nFeature_RNA",
  pt.size = 0.5, shuffle = TRUE)  + geom_vline(xintercept = c(600,12000)) + geom_hline(yintercept = c(400, 4000))

Cell filtering

We use the information above to filter out cells. Here we choose those that have percent mitochondrial genes max of 8%, unique UMI counts under 1,000 or greater than 12,000 and contain at least 700 features within them.

table(experiment.aggregate$orig.ident)
## 
## A001-C-007 A001-C-104 B001-A-301 
##       3047       5881      21974
experiment.aggregate <- subset(experiment.aggregate, percent.mito <= 8)

experiment.aggregate <- subset(experiment.aggregate, nCount_RNA >= 500 & nCount_RNA <= 12000)

experiment.aggregate <- subset(experiment.aggregate, nFeature_RNA >= 400 & nFeature_RNA < 4000)

experiment.aggregate
## An object of class Seurat 
## 21005 features across 10595 samples within 1 assay 
## Active assay: RNA (21005 features, 0 variable features)
table(experiment.aggregate$orig.ident)
## 
## A001-C-007 A001-C-104 B001-A-301 
##       1774       3416       5405

Lets se the ridge plots now after filtering

RidgePlot(experiment.aggregate, features=c("nFeature_RNA","nCount_RNA", "percent.mito"), ncol = 2)


You may also want to filter out additional genes.


When creating the base Seurat object we did filter out some genes, recall Keep all genes expressed in >= 10 cells. After filtering cells and you may want to be more aggressive with the gene filter. Seurat doesn’t supply such a function (that I can find), so below is a function that can do so, it filters genes requiring a min.value (log-normalized) in at least min.cells, here expression of 1 in at least 400 cells.

experiment.aggregate
FilterGenes <-
 function (object, min.value=1, min.cells = 0, genes = NULL) {
   genes.use <- rownames(object)
   if (!is.null(genes)) {
     genes.use <- intersect(genes.use, genes)
     object@data <- GetAssayData(object)[genes.use, ]
   } else if (min.cells > 0) {
     num.cells <- Matrix::rowSums(GetAssayData(object) > min.value)
     genes.use <- names(num.cells[which(num.cells >= min.cells)])
     object = object[genes.use, ]
   }
  object <- LogSeuratCommand(object = object)
  return(object)
}

experiment.aggregate.genes <- FilterGenes(object = experiment.aggregate, min.value = 1, min.cells = 400)
experiment.aggregate.genes
rm(experiment.aggregate.genes)

Next we want to normalize the data

After filtering out cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method LogNormalize that normalizes the gene expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and then log-transforms the data.

?NormalizeData
experiment.aggregate <- NormalizeData(
  object = experiment.aggregate,
  normalization.method = "LogNormalize",
  scale.factor = 10000)

Calculate Cell-Cycle with Seurat, the list of genes comes with Seurat (only for human)

Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq

# this code is for human samples only!
s.genes <- (cc.genes$s.genes)
g2m.genes <- (cc.genes$g2m.genes)

experiment.aggregate <- CellCycleScoring(experiment.aggregate,
                                         s.features = s.genes,
                                         g2m.features = g2m.genes,
                                         set.ident = TRUE)
table(experiment.aggregate@meta.data$Phase) %>%
  kable(caption = "Number of Cells in each Cell Cycle Stage", col.names = c("Stage", "Count"), align = "c") %>%
  kable_styling()
Number of Cells in each Cell Cycle Stage
Stage Count
G1 5465
G2M 2327
S 2803

For mouse data, we would need to convert the gene lists from human genes to their mouse orthologs using Biomart. Skip this code for the workshop data.

# Mouse Code DO NOT RUN
convertHumanGeneList <- function(x){
  require("biomaRt")
  human = useEnsembl("ensembl", dataset = "hsapiens_gene_ensembl", mirror = "uswest")
  mouse = useEnsembl("ensembl", dataset = "mmusculus_gene_ensembl", mirror = "uswest")

  genes = getLDS(attributes = c("hgnc_symbol"), filters = "hgnc_symbol", values = x , mart = human, attributesL = c("mgi_symbol"), martL = mouse, uniqueRows=T)

  humanx <- unique(genes[, 2])

  # Print the first 6 genes found to the screen
  print(head(humanx))
  return(humanx)
}

m.s.genes <- convertHumanGeneList(cc.genes.updated.2019$s.genes)
m.g2m.genes <- convertHumanGeneList(cc.genes.updated.2019$g2m.genes)

# Create our Seurat object and complete the initialization steps
experiment.aggregate <- CellCycleScoring(experiment.aggregate, s.features = m.s.genes, g2m.features = m.g2m.genes, set.ident = TRUE)

table(experiment.aggregate@meta.data$Phase) %>% kable(caption = "Number of Cells in each Cell Cycle Stage", col.names = c("Stage", "Count"), align = "c") %>% kable_styling()

Fixing the defualt “Ident” in Seurat

table(Idents(experiment.aggregate))
## 
##    S  G2M   G1 
## 2803 2327 5465
## So lets change it back to sample name
Idents(experiment.aggregate) <- "orig.ident"
table(Idents(experiment.aggregate))
## 
## A001-C-007 A001-C-104 B001-A-301 
##       1774       3416       5405

Identify variable genes

The function FindVariableFeatures identifies the most highly variable genes (default 2000 genes) by fitting a line to the relationship of log(variance) and log(mean) using loess smoothing, uses this information to standardize the data, then calculates the variance of the standardized data. This helps avoid selecting genes that only appear variable due to their expression level.

?FindVariableFeatures
experiment.aggregate <- FindVariableFeatures(
  object = experiment.aggregate,
  selection.method = "vst")

length(VariableFeatures(experiment.aggregate))
## [1] 2000
top10 <- head(VariableFeatures(experiment.aggregate), 10)

top10
##  [1] "BEST4"   "NRG1"    "TPH1"    "CLCA4"   "CACNA1A" "TRPM3"   "CTNNA2" 
##  [8] "TIMP3"   "CA7"     "LRMP"
vfp1 <- VariableFeaturePlot(experiment.aggregate)
vfp1 <- LabelPoints(plot = vfp1, points = top10, repel = TRUE)
vfp1

FindVariableFeatures isn’t the only way to set the “variable features” of a Seurat object. Another reasonable approach is to select a set of “minimally expressed” genes.

dim(experiment.aggregate)
## [1] 21005 10595
min.value = 2
min.cells = 10
num.cells <- Matrix::rowSums(GetAssayData(experiment.aggregate, slot = "count") > min.value)
genes.use <- names(num.cells[which(num.cells >= min.cells)])
length(genes.use)
## [1] 5986
VariableFeatures(experiment.aggregate) <- genes.use

Question(s)

  1. Play some with the filtering parameters, see how results change?
  2. How do the results change if you use selection.method = “dispersion” or selection.method = “mean.var.plot”

Finally, lets save the filtered and normalized data

save(experiment.aggregate, file="pre_sample_corrected.RData")

Get the next Rmd file

download.file("https://raw.githubusercontent.com/msettles/2022-Uganda-Single-Cell-RNA-Seq-Analysis/main/data_analysis/scRNA_Workshop-PART3.Rmd", "scRNA_Workshop-PART3.Rmd")

Session Information

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kableExtra_1.3.4   knitr_1.37         ggplot2_3.3.5      biomaRt_2.50.2    
## [5] SeuratObject_4.0.4 Seurat_4.1.0      
## 
## loaded via a namespace (and not attached):
##   [1] systemfonts_1.0.4      BiocFileCache_2.2.1    plyr_1.8.6            
##   [4] igraph_1.2.11          lazyeval_0.2.2         splines_4.1.2         
##   [7] listenv_0.8.0          scattermore_0.7        GenomeInfoDb_1.30.0   
##  [10] digest_0.6.29          htmltools_0.5.2        fansi_1.0.2           
##  [13] magrittr_2.0.2         memoise_2.0.1          tensor_1.5            
##  [16] cluster_2.1.2          ROCR_1.0-11            globals_0.14.0        
##  [19] Biostrings_2.62.0      matrixStats_0.61.0     svglite_2.1.0         
##  [22] spatstat.sparse_2.1-0  prettyunits_1.1.1      colorspace_2.0-2      
##  [25] rvest_1.0.2            rappdirs_0.3.3         blob_1.2.2            
##  [28] ggrepel_0.9.1          xfun_0.29              dplyr_1.0.8           
##  [31] crayon_1.5.0           RCurl_1.98-1.5         jsonlite_1.8.0        
##  [34] spatstat.data_2.1-2    survival_3.2-13        zoo_1.8-9             
##  [37] glue_1.6.2             polyclip_1.10-0        gtable_0.3.0          
##  [40] zlibbioc_1.40.0        XVector_0.34.0         webshot_0.5.2         
##  [43] leiden_0.3.9           future.apply_1.8.1     BiocGenerics_0.40.0   
##  [46] abind_1.4-5            scales_1.1.1           DBI_1.1.2             
##  [49] miniUI_0.1.1.1         Rcpp_1.0.8.3           progress_1.2.2        
##  [52] viridisLite_0.4.0      xtable_1.8-4           reticulate_1.24       
##  [55] spatstat.core_2.3-2    bit_4.0.4              stats4_4.1.2          
##  [58] htmlwidgets_1.5.4      httr_1.4.2             RColorBrewer_1.1-2    
##  [61] ellipsis_0.3.2         ica_1.0-2              farver_2.1.0          
##  [64] pkgconfig_2.0.3        XML_3.99-0.8           dbplyr_2.1.1          
##  [67] sass_0.4.0             uwot_0.1.11            deldir_1.0-6          
##  [70] utf8_1.2.2             labeling_0.4.2         tidyselect_1.1.2      
##  [73] rlang_1.0.2            reshape2_1.4.4         later_1.3.0           
##  [76] AnnotationDbi_1.56.2   munsell_0.5.0          tools_4.1.2           
##  [79] cachem_1.0.6           cli_3.2.0              generics_0.1.2        
##  [82] RSQLite_2.2.9          ggridges_0.5.3         evaluate_0.14         
##  [85] stringr_1.4.0          fastmap_1.1.0          yaml_2.3.5            
##  [88] goftest_1.2-3          bit64_4.0.5            fitdistrplus_1.1-6    
##  [91] purrr_0.3.4            RANN_2.6.1             KEGGREST_1.34.0       
##  [94] pbapply_1.5-0          future_1.23.0          nlme_3.1-155          
##  [97] mime_0.12              xml2_1.3.3             compiler_4.1.2        
## [100] rstudioapi_0.13        filelock_1.0.2         curl_4.3.2            
## [103] plotly_4.10.0          png_0.1-7              spatstat.utils_2.3-0  
## [106] tibble_3.1.6           bslib_0.3.1            stringi_1.7.6         
## [109] highr_0.9              lattice_0.20-45        Matrix_1.4-0          
## [112] vctrs_0.3.8            pillar_1.7.0           lifecycle_1.0.1       
## [115] spatstat.geom_2.3-1    lmtest_0.9-39          jquerylib_0.1.4       
## [118] RcppAnnoy_0.0.19       data.table_1.14.2      cowplot_1.1.1         
## [121] bitops_1.0-7           irlba_2.3.5            httpuv_1.6.5          
## [124] patchwork_1.1.1        R6_2.5.1               promises_1.2.0.1      
## [127] KernSmooth_2.23-20     gridExtra_2.3          IRanges_2.28.0        
## [130] parallelly_1.30.0      codetools_0.2-18       MASS_7.3-55           
## [133] assertthat_0.2.1       withr_2.4.3            sctransform_0.3.3     
## [136] S4Vectors_0.32.3       GenomeInfoDbData_1.2.7 hms_1.1.1             
## [139] mgcv_1.8-38            parallel_4.1.2         grid_4.1.2            
## [142] rpart_4.1.16           tidyr_1.2.0            rmarkdown_2.11        
## [145] Rtsne_0.15             Biobase_2.54.0         shiny_1.7.1