SCIBER is a simple method that outputs the batch-effect corrected expression data in the original space/dimension. These expression data of individual genes can be directly used for all follow-up analyses. SCIBER has four steps; each step has a clear biological meaning, and the algorithms used for them are k-means clustering, t-test, Fisher’s exact test, and linear regression, respectively, all of which are easily comprehensible.
Install SCIBER with standard commands,
or install the development version of SCIBER with the following commands.
Once SCIBER is installed, load it.
We downloaded two batches of Human dendritic cell data from this paper
Villani, A. C., Satija, R., Reynolds, G., Sarkizova, S., Shekhar, K., Fletcher, J., … & Hacohen, N. (2017). Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science, 356(6335), eaah4573.
We library normalized the cells, log transformed the counts, and selected top 500 highly variable genes for each batch. We pooled all the genes and use them as the genes for both batches. The pre-processed data are available as part of this package.
Please note that for each data frame in the object meta
,
there should be two columns named cell_id
and
cell_type
. For instance, let meta_i
be a data
frame under meta
, and there should be two columns
meta_i$cell_id
and meta_i$cell_type
. If the
cell type information is not available, any values put in
meta_i$cell_type
should work.
We first specify the parameter we want to use in SCIBER. We set omega = 0.5 which is also the default setting in SCIBER. Setting ref_index = 1 indicates the first bacth is treated as the reference batch while the second is the query batch. By using n_core = 1, we only use 1 core to run SCIBER.
Let’s run SCIBER to remove the batch effects.
res <- SCIBER(input_batches = exp, ref_index = ref_index,
batches_meta_data = meta, omega = omega, n_core = n_core)
#> [1] "The available number of cores is 4. SCIBER uses 1 to perform batch effect removal."
The output of SCIBER is a list of batches, which is the same as the input exp. The order of batches in res is the same as that of exp.
Next, we combine the output batches, do PCA and UMAP before plotting them.
library(stats)
library(Matrix)
library(uwot)
do_PCA <- function(dat, PCs){
dat_pca_embeddings <- prcomp(t(as.matrix(dat)), scale. = F)
dat_pca_embeddings <- dat_pca_embeddings$x
dat_pca_embeddings <- dat_pca_embeddings[, 1:as.numeric(PCs)]
return(dat_pca_embeddings)
}
do_umap <- function(V) {
umap(
X = V,
n_threads = 6,
n_neighbors = 30L,
n_components = 2L,
metric = 'cosine',
n_epochs = NULL,
learning_rate = 1.0,
min_dist = 0.3,
spread = 1.0,
set_op_mix_ratio = 1.0,
local_connectivity = 1L,
repulsion_strength = 1,
negative_sample_rate = 1,
a = NULL,
b = NULL,
fast_sgd = FALSE,
verbose = FALSE
)
}
meta_data <- rbind(meta[[1]], meta[[2]])
rownames(meta_data) <- meta_data$cell_id
projected_dat <- cbind(res[[1]], res[[2]])
all(rownames(meta_data) == colnames(projected_dat))
#> [1] TRUE
SCIBER_pca <- do_PCA(projected_dat, PCs = 20)
SCIBER_umap <- do_umap(SCIBER_pca)
Then, we load necessary packages and function for plots.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(ggthemes)
library(cowplot)
#>
#> Attaching package: 'cowplot'
#> The following object is masked from 'package:ggthemes':
#>
#> theme_map
obtain_plot <- function(
umap_use,
meta_data,
label_name,
palette_use = tableau_color_pal()(10),
pt_size = 4, point_size = 0.5, pt_shape = '.',
base_size = 12,
do_points = TRUE,
do_density = FALSE,
legend_position = "top"
){
plt_df <- umap_use %>% data.frame() %>% cbind(meta_data) %>%
sample_frac(1L)
plt <- plt_df %>%
ggplot(aes_string("X1", "X2", col = label_name,fill = label_name)) +
theme_tufte(base_size = base_size) +
theme(panel.background = element_rect(fill = NA, color = "black")) +
guides(color = guide_legend(override.aes = list(stroke = 1,
alpha = 1, shape = 16, size = 4)),
alpha = FALSE) +
scale_color_manual(values = palette_use, guide = "none") +
scale_fill_manual(values = palette_use, guide = "none") +
theme(plot.title = element_text(hjust = 0.5, family = "sans"),
legend.text = element_text(family = "sans"),
legend.title = element_text(family = "sans"),
legend.position= as.character(legend_position)) +
labs(x = "UMAP 1", y = "UMAP 2")
if (do_points)
plt <- plt + geom_point(shape = pt_shape, size = point_size)
if (do_density)
plt <- plt + geom_density_2d()
return(plt)
}
Choose colors for cell types and batches.
colors_cell <- tableau_color_pal("Classic 20",
direction = 1)(length(unique(meta_data$cell_type)))
colors_batch <- tableau_color_pal("Classic Green-Orange 6",
direction = 1)(length(unique(meta_data$dataset)))
Let’s see the umap plots!
SCIBER_plt1 <- obtain_plot(SCIBER_umap, meta_data, "dataset", palette_use = colors_batch,
pt_shape = 19, pt_size = .4, legend_position = "top")
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
#> of ggplot2 3.3.4.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
SCIBER_plt2 <- obtain_plot(SCIBER_umap, meta_data, "cell_type", palette_use = colors_cell,
pt_shape = 19, pt_size = .4, legend_position = "top")
plot_grid(SCIBER_plt1, SCIBER_plt2, nrow = 2)
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] cowplot_1.1.3 ggthemes_5.1.0 ggplot2_3.5.1 dplyr_1.1.4 uwot_0.2.2
#> [6] Matrix_1.7-2 SCIBER_0.2.1 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.9 generics_0.1.3 stringi_1.8.4 lattice_0.22-6
#> [5] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3 grid_4.4.2
#> [9] fastmap_1.2.0 jsonlite_1.9.0 purrr_1.0.4 scales_1.3.0
#> [13] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.4 crayon_1.5.3
#> [17] rlang_1.1.5 RcppAnnoy_0.0.22 munsell_0.5.1 withr_3.0.2
#> [21] cachem_1.1.0 yaml_2.3.10 tools_4.4.2 parallel_4.4.2
#> [25] colorspace_2.1-1 buildtools_1.0.0 vctrs_0.6.5 R6_2.6.1
#> [29] lifecycle_1.0.4 stringr_1.5.1 irlba_2.3.5.1 pkgconfig_2.0.3
#> [33] pillar_1.10.1 bslib_0.9.0 gtable_0.3.6 glue_1.8.0
#> [37] Rcpp_1.0.14 xfun_0.51 tibble_3.2.1 tidyselect_1.2.1
#> [41] sys_3.4.3 knitr_1.49 farver_2.1.2 htmltools_0.5.8.1
#> [45] maketools_1.3.2 labeling_0.4.3 compiler_4.4.2