2. Tutorial (human, RSEM)#

This page describes the tutorial how to get the results from FASTQ files. The sample scripts are available at RumBall GitHub.

Note

This tutorial assumes using the RumBall singularity image (rumball.sif). Please add singularity exec rumball.sif before each command below.
Example: singularity exec rumball.sif download_genomedata.sh

2.1. Get data#

Here we use four mRNA-seq samples of HEK293 cells (siCTCF and control from Zuin et al., PNAS, 2014):

mkdir -p fastq
for id in SRR710092 SRR710093 SRR710094 SRR710095
do
    fastq-dump --gzip $id --split-files -O fastq
done

Then download and generate the reference dataset including genome, gene annotation and index files. RumBall contains several scripts to do that:

build=GRCh38  # specify the build (Ensembl) that you need
Ddir=Ensembl-$build/
mkdir -p log

# Download genome and gtf
download_genomedata.sh $build $Ddir 2>&1 | tee log/Ensembl-$build

# make index for STAR-RSEM
ncore=12 # number of CPUs
build-index-RNAseq.sh -p $ncore rsem-star $build $Ddir

2.2. Check Strandedness#

If the strandedness of RNA-seq data is not clear, you can briefly check by check_stranded.sh command:

$ check_stranded.sh human fastq/SRR710092_1.fastq.gz
# reads processed: 56830606
# reads with at least one alignment: 27787970 (48.90%)
# reads that failed to align: 29042636 (51.10%)
Reported 27787970 alignments
 540264 +
27247706 -

In this example, majority of reads were mapped on - strand, so this RNA-seq is stranded.

2.3. Mapping reads by STAR#

RumBall allows STAR, bowtie2, kallisto and salmon for mapping. Here we use STAR. The reads are then parsed by RSEM:

Ddir=Ensembl-GRCh38
mkdir -p log
star.sh paired HEK293_Control_rep1 "fastq/SRR710092_1.fastq.gz fastq/SRR710092_2.fastq.gz" $Ddir reverse > log/star.sh.HEK293_Control_rep1
star.sh paired HEK293_Control_rep2 "fastq/SRR710093_1.fastq.gz fastq/SRR710093_2.fastq.gz" $Ddir reverse > log/star.sh.HEK293_Control_rep2
star.sh paired HEK293_siCTCF_rep1  "fastq/SRR710094_1.fastq.gz fastq/SRR710094_2.fastq.gz" $Ddir reverse > log/star.sh.HEK293_siCTCF_rep1
star.sh paired HEK293_siCTCF_rep2  "fastq/SRR710095_1.fastq.gz fastq/SRR710095_2.fastq.gz" $Ddir reverse > log/star.sh.HEK293_siCTCF_rep2

Of course you can also use a shell loop:

ID=("SRR710092" "SRR710093" "SRR710094" "SRR710095")
NAME=("HEK293_Control_rep1" "HEK293_Control_rep2" "HEK293_siCTCF_rep1" "HEK293_siCTCF_rep2")

mkdir -p log
for ((i=0; i<${#ID[@]}; i++))
do
    echo ${NAME[$i]}
    fq1=fastq/${ID[$i]}_1.fastq.gz
    fq2=fastq/${ID[$i]}_2.fastq.gz
    star.sh paired ${NAME[$i]} "$fq1 $fq2" $Ddir reverse > log/${NAME[$i]}.star.sh
done

2.3.1. Differential analysis#

rsem_merge.sh merges the RSEM output of all samples. The generated matrix can be applied to DESeq2 or edgeR to identify differentially expressed genes between two groups:

Ctrl="star/HEK293_Control_rep1 star/HEK293_Control_rep2"
siCTCF="star/HEK293_siCTCF_rep1 star/HEK293_siCTCF_rep2"

# For DESeq2
mkdir -p Matrix_deseq2
rsem_merge.sh "$Ctrl $siCTCF" Matrix_deseq2/HEK293 $Ddir
DESeq2.sh Matrix_deseq2/HEK293 2:2 Control:siCTCF Human

# For edgeR
mkdir -p Matrix_edgeR
rsem_merge.sh "$Ctrl $siCTCF" Matrix_edgeR/HEK293 $Ddir
edgeR.sh Matrix_edgeR/HEK293 2:2 Control:siCTCF Human

From v0.3.0, DESeq2.sh and edgeR.sh also implement gene onthology (GO) analysis using ClusterProfiler and gprofiler2. They use top-ranked 500 upregulated/downregulated DEGs for the GO analysis. Use -n option the change the gene number.

2.4. Mapping reads by Bowtie2#

Because STAR requires large amounts of memory for mapping, it is not suitable for a non-high performance computing environment. Bowtie2 needs less memory with comparable mapping accuracy, although it is slower than STAR. Here is an example using Bowtie2.

ID=("SRR710092" "SRR710093" "SRR710094" "SRR710095")
NAME=("HEK293_Control_rep1" "HEK293_Control_rep2" "HEK293_siCTCF_rep1" "HEK293_siCTCF_rep2")

mkdir -p log
for ((i=0; i<${#ID[@]}; i++))
do
    echo ${NAME[$i]}
    fq1=fastq/${ID[$i]}_1.fastq.gz
    fq2=fastq/${ID[$i]}_2.fastq.gz
    bowtie2.sh paired ${NAME[$i]} "$fq1 $fq2" $Ddir reverse > log/${NAME[$i]}.bowtie2.sh
done

The results are stored in the bowtie2 directory. The mapping statistics are in the log files bowtie2/${NAME[$i]}.log. Additionally, the log files are parsed by parsebowtielog2.pl to output a summary table of all samples:

mkdir -p log
for ((i=0; i<${#ID[@]}; i++))
do
    parsebowtielog2.pl -p $odir/${NAME[$i]}.log
done

The -p option is needed if the reads are paired-end.

2.4.1. Differential analysis#

The differential analysis step is the same with the STAR example above:

Ctrl="bowtie2/HEK293_Control_rep1 bowtie2/HEK293_Control_rep2"
siCTCF="bowtie2/HEK293_siCTCF_rep1 bowtie2/HEK293_siCTCF_rep2"

# For DESeq2
mkdir -p Matrix_deseq2_bowtie2
rsem_merge.sh "$Ctrl $siCTCF" Matrix_deseq2_bowtie2/HEK293 $Ddir
DESeq2.sh Matrix_deseq2_bowtie2/HEK293 2:2 Control:siCTCF Human

# For edgeR
mkdir -p Matrix_edgeR_bowtie2
rsem_merge.sh "$Ctrl $siCTCF" Matrix_edgeR_bowtie2/HEK293 $Ddir
edgeR.sh Matrix_edgeR_bowtie2/HEK293 2:2 Control:siCTCF Human