Alignment

In order to quantify transcription on a transcript to transcript basis, individual reads called during sequencing must be mapped to the genome. While there are multiple alignment software packages available, XPRESSpipe uses a current version of STAR to perform this step in transcription quantification for several reasons:

- Performance: While computationally greedy (a human genome alignment requires upwards of 30 Gb RAM), the performance and accuracy is superior to the majority of other splice aware aligners currently available
- Splice Junction Aware: STAR is capable of mapping reads spanning a splice junction, where more traditional packages, such as Bowtie, are incapable of doing so and are better suited for tasks such as genome alignment.
- Standardized: The foundation of the pipeline used in XPRESSpipe is based in the TCGA standards for RNAseq alignment. This method utilizes a guided or 2-pass alignment program. In the guided alignment, a GTF with annotated splice junctions is used to guide the alignments over splice juntions. In the 2-pass alignment, reads are mapped across the genome to identify novel splice junctions. These new annotations are then incorporated into the reference index and reads are re-aligned with this new reference. While more time-intensive, this step can aid in aligning across these junctions, especially in organisms where the transcriptome is not as well annotated.
- Variant Aware: The user can provide a VCF, such as those provided by ClinVar and dbSNP. These files are useful in integrating information about common or disease nucleotide variants into the RNA-Seq alignment stage. The files you use should match the build of the genome you are using (i.e., if using Homo Sapiens GRCh38, these builds should match between curated reference files and VCF file).

Note

rRNA depletion using the --remove_rrna option removes rRNA alignments from BAM files. This works by generating a BED file behind the scenes for rRNA transcripts, and removing them from the genome-aligned BAM file using bedtools intersect. For transcriptome-aligned BAM files, a modified GTF file is generated for this step only with rRNA records removed in order to prevent their transcript mapping during this step.

Arguments

The help menu can be accessed by calling the following from the command line:
$ xpresspipe align --help
Required Arguments Description
-i <path>, --input <path> Path to input directory
-o <path>, --output <path> Path to output directory
-r <path>, --reference <path> Path to parent organism reference directory (must have a file called transcripts.gtf within)
-t <SE or PE>, --type <SE or PE> Sequencing type (“SE” for single-end, “PE” for paired-end)
Optional Arguments Description
--suppress_version_check Suppress version checks and other features that require internet access during processing
--two-pass Use a two-pass STAR alignment for novel splice junction discovery
--remove_rrna Provide flag to remove rRNA records from alignment files (BAM files)
--no_multimappers> Include flag to remove multimapping reads to be output and used in downstream analyses
--deduplicate Include flag to quantify reads with de-duplication (will search for files with suffix _dedupRemoved.bam)
--vcf </path/to/file.vcf> Provide full path and file name to VCF file if you would like detect personal variants overlapping alignments
--output_bed Include flag to output BED files for each aligned file
--sjdbOverhang <sjdbOverhang_amount> Specify length of genomic sequences for constructing splice-aware reference. Ideal length is read length - 1, so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of 100 should work in most cases
--mismatchRatio <mismatchRatio> Alignment ratio of mismatches to mapped length is less than this value. See STAR documentation for more information on setting this parameter
--seedSearchStartLmax <seedSearchStartLmax> Adjusting this parameter by providing a lower number will improve mapping sensitivity (recommended value = 15 for reads ~ 25 nts). See STAR documentation for more information on setting this parameter
genome_size Only needs to be changed if this argument was provided curing reference building AND using a two-pass alignment. Enter the size of your organism’s genome in nucleotides
-m Number of max processors to use for tasks (default: No limit)

Single-End RNAseq Alignment

The following runs single-end reads alignment using the specified XPRESSpipe-formatted reference directory.
Notes:
- For the --sjdbOverhang argument, the same value should be entered that was used when creating the STAR reference files.
- Ribosome profiling data can be run as a single-end RNA-seq

Example 1: Single-end RNAseq alignment

- Raw reads are fastq-like and found in the -i /path/to/input/files/ directory. Can be uncompressed or compressed via gz or zip
- A general output directory has been created, -o riboseq_out/
- --type is specified as ‘SE’ and path to parent reference directory is provided
- The value for --sjdbOverhang used in reference creation is provided. Failure to do so will trigger an error
- BED and BIGWIG files will be output in their own directories in output
- All other arguments use the default value
$ xpresspipe align -i /path/to/input/files/ -o riboseq_out/ -t SE -r /path/to/reference/ --sjdbOverhang 49 --output_bed --output_bigwig

Paired-End RNAseq Alignment

The following runs paired-end reads alignment using the specified XPRESSpipe-formatted reference directory.
Notes:
- For the --sjdbOverhang argument, the same value should be entered that was used when creating the STAR reference files.

Example 1: Paired-end RNAseq alignment

- Raw reads are fastq-like and found in the -i pe_test/ directory. Can be uncompressed or compressed via gz or zip
- A general output directory has been created, -o pe_out/
- --type is specified as ‘PE’ and path to parent reference directory is provided
- The value for --sjdbOverhang used in reference creation is provided. Failure to do so will trigger an error. In this case, since the reference was created using default values, the optional flag is not used
- BED and BIGWIG files are not output
- All other arguments use the default value
$ xpresspipe align -i /path/to/input/files/ -o riboseq_out -t PE -r /path/to/reference/