I've got Chromium Output... Now what?
The output from Chromium has some technical artifacts in it. Here I'll walk you through how to get rid of (at least some of) those artifacts.
Let's look at an example
The Chromium people will tell you that this is just because different cell types have different total RNA. That's true - but what we really want to be clustering on is "cell type markers" or something like that. Not total RNA content. Another problem arises when you want to compare two different Chromium runs since there are major batch effects in chemistry efficiency (as you will see when I walk you through a tutorial on how to post-process the Chromium output). Without correcting for total UMI, you'll just get clustering based on batch rather than by cell type. Below I'll show you how I currently take these factors into account and get a 'cleaned up' expression matrix (that is also PyMINEr compatible!).
Let's work on two chromium files from the same type of sample
First, we need to get the Chromium software to be able to do anything with these files. There is a notable difference between V2 and V3 of CellRanger, so for working with your own dataset, make sure that you are using the same version of CellRanger that was used to make the output files.
Here, we'll be using CellRanger 3
Note that you don't need to download the references (which are ~11Gb).
Unzip the CellRanger tar.gz file, and keep track of where you have that folder - you'll need it later.
I used: ~/bin/cellranger-3.0.2
because it's compatible with the below two datasets
Download the "Feature / cell matrix HDF5 (filtered)" - this is what you'll use for your own files as well
Put each of these files in their own folder
I used: ~/Downloads/chromium_example/1k_v2 and ~/Downloads/chromium_example/1k_v3
Now we need to download the software I've written (with help reorganizing and cleaning by Anthony Fejes - Thanks!)
Now - unzip the downloaded folder (again keep track of this)
I put this in: ~/bin/pyminer_chromium_processing
Those are all the things that we needed to download... finally... So:
On with the analysis already!
First, open up terminal and cd into the directory with the processing software:
Now we'll use the first script called "process_chromium_h5" (Keep in mind that you can always add -h to the end of one of these command line calls to get more details on how to use them):
Now we'll do for the other file:
This generates a .TSV file (than can be opened in excel) of the original dataset. This is the file that we'll use moving forward.
Now we'll look at the sum of unique molecular identifiers (UMI) that were observed in each cell in each dataset as well as the count of genes that were observed:
There should be some plots made that look like this:
Looking at these plots, you'll have to come up with some cutoffs to get the majority of these samples, but not include the outliers (bearing in mind this is log scale).
The sums are the number of total UMI, while the count of genes - is how many unique genes in the genome showed up in that sample. As shown by the scatter, they're pretty correlated, but there are some outlier groups too (see green in lower left of scatter). This speaks to why we need cutoffs in both dimensions. In practice this scatter is often dirtier looking as well.
Looking at the log10(sums), reasonable cutoffs look like 10^3.4 (i.e.: 2512: lower cutoff) 10^4.25 (i.e.: 17782: upper cutoff). (See the arrows).
For the log10(counts), reasonable cutoffs look like 10^2.8 (i.e.: 631: lower cutoff) 10^3.6 (i.e.: 3981: upper cutoff). (See the arrows).
After applying these cutoffs, we will be left with cells that have between 2512-17782 total UMI and 631-3981 observed genes. In the next script, we'll give the it the input .TSV files and these cutoffs.
Filtering, downsampling, and Combining the TSV files
Now that we have good cutoffs, we'll make the output directory, then use these cutoffs in the next script which filters out the bad samples, downsamples (to 95% of the lowest included UMI sum), merges the files, and log transforms the downsampled matrix.
It'll take a few minutes (scaling with the number of cells). After it's done you'll have 4 files in your output folder:
The hdf5 files are your final combined data matrices - we need to use this file format given the magnitude of the datasets often seen with scRNAseq. You can't really look at a dataset of 40,000 cells in excel, unfortunately. The HDF5 files can be used as PyMINEr input along with the column and row IDs (column_IDs.txt & ID_list.txt respectively).
Analyzing with PyMINEr
Now that we've removed some of the techical artifacts from Chromium's processing pipeline, we'll actually get into the fun part of the analysis - figuring out some biology! (With PyMINEr - PS: I'm definitely not biased in my tool selection). If you already have PyMINEr installed:
If you don't have PyMINEr installed, here's how