Curating References

In order to quantify transcription levels from RNA-Seq data, reads must be mapped to a reference genome or transcriptome. While there are multiple alignment software packages available, XPRESSpipe performs this step using a current version of STAR for several reasons:

- Splice Junction Aware: STAR is capable of mapping reads spanning a splice junction, where more traditional packages, such as Bowtie, are incapable of doing so and are better suited for tasks such as genome alignment.
- Performance: While computationally greedy (a human genome alignment requires upwards of 30 Gb RAM), the performance and accuracy is excellent compared to the majority of other splice-aware aligners currently available
- Standard: The foundation of the pipeline used in XPRESSpipe is based in the TCGA standards for RNA-Seq alignment. This method utilizes a guided or 2-pass alignment program. In the guided alignment, a GTF with annotated splice junctions is used to guide the alignments over splice juntions. In the 2-pass alignment, reads are mapped across the genome to identify novel splice junctions. These new annotations are then incorporated into the reference index and reads are re-aligned with this new reference. While more time-intensive, this step can aid in aligning across these junctions, especially in organisms where the transcriptome is not as well annotated. If mapping to a well-documented organism, this step can be forgone and STAR will use the GTF annotations to determine intronic regions of transcripts for read mapping.

XPRESSpipe Reference Requirements

An XPRESSpipe compatible reference directory must meet some requirements:

- All chromosomal genome fasta files are in their own directory within the parent reference directory. If a FASTA file with all chromosomes combined is available for your organism, this can be provided, but must be in its own directory.
- A sub-directory, named genome, contains the STAR reference files. If createReference is used to curate the reference, and the parent reference directory was provided as output location, this directory creation and file formatting will be handled automatically by XPRESSpipe.
- A transcript reference (GTF), is located in the reference parent directory and is named transcripts.gtf. If a coding-only or truncated reference GTFs are desired for read quantification, these should also be in this directory (truncate will handle file naming and formatting so long as the output location is specified as this parent directory). This file will then need to be specified within an XPRESSpipe pipeline.

Note

A completed reference directory can be created that follows these requirements by creating a directory, placing the transcripts.gtf and genomic chromosome fasta files in the parent directory and running curateReference as described below**

Get Sequence Files

The following is an example of how to get the reference files needed for generating a human reference:
$ mkdir human_reference
$ mkdir human_reference/genome_fasta
$ cd human_reference/
$ curl ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.gtf.gz -o transcripts.gtf.gz
$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y; do curl -O ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.${i}.fa.gz; done
$ gzip -d *.gz
$ mv *fasta genome_fasta
The chromosome IDs may vary depending on your organism.

Note

We recommend against using the toplevel Ensembl files. In our experience, this leads to RAM issues in STAR.

Perform Full Reference Curation

The following will create a XPRESSpipe-formatted reference directory containing all STAR reference files and transcript references needs for quantification and meta-analysis.
A parent reference directory containing the transcripts.gtf file and all chromosomal genome fasta files must be present.

More details as to what each specific parameter is doing can be found in the sections below.*

Arguments

The help menu can be accessed by calling the following from the command line:
$ xpresspipe curateReference --help
Required Arguments Description
-o <path>, --output <path> Path to output directory
-f <path>, --fasta <path> Path to genome fasta files (file names should end in .fa, .fasta, or .txt and no other files should exist in the directory with similar extensions)
-g </path/transcripts.gtf>, --gtf </path/transcripts.gtf> Path and file name to transcript reference file names ‘transcripts.gtf’
Optional Arguments Description
--suppress_version_check Suppress version checks and other features that require internet access during processing
-l, --longest_transcript Provide argument to keep only longest transcript per gene record (RECOMMENDED)
-p, --protein_coding Provide argument to keep only gene records annotated as protein coding genes
-t, --truncate Provide argument to truncate gene records
--truncate_5prime Amount to truncate from 5’ end of each transcript, requires –truncate argument (default: 45)
--truncate_3prime Amount to truncate from 3’ end of each transcript, requires –truncate argument (default: 15)
--sjdbOverhang <value> Specify length of genomic sequences for constructing splice-aware reference. Ideal length is read length - 1, so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of 100 should work in most cases
--genome_size <int> If mapping to an organism with a small genome, provide the length in nucleotides. If you are not sure your organism has a small genome, provide the number of bases and XPRESSpipe will decide if this parameter needs to be changed during runtime
--ucsc_format Input GTF is UCSC/refseq formatted. This flag only pertains to GTF modification, such as end truncation, not to STAR curation processes. Errors related to STAR GTF formatting need to be separately addressed.
-m Number of max processors to use for tasks (default: No limit)

Example 1: Create XPRESSpipe-formatted reference for single-end alignment

- Creates a star reference for single-end read mapping (1x50bp reads)
- Keeps the longest transcript for each gene record
- Keeps only protein_coding annotated transcripts
- Truncates the first 45 nucleotides from the first exon of every transcript (default)
- Truncates the last 15 nucleotides from the last exon of every transcript (default)
$ xpresspipe curateReference -o /path/to/se/ref/ -f /path/to/se/ref/ -g /path/to/se/ref/transcripts.gtf --longest_transcript --protein_coding --truncate --sjdbOverhang 49

Example 2: Create refFlat files

- Creates a star reference for paired-end read mapping (2x100bp reads)
- No modifications are made to the GTF file
- Processes are limited to 10 cores
$ xpresspipe curateReference -o /path/to/pe/ref/ -f /path/to/pe/ref/ -g /path/to/pe/ref/transcripts.gtf -m 10

STAR Reference Curation

The following creates a STAR reference compatible with XPRESSpipe. These files are output in a directory created during curation called genome in the specified --output directory.

Arguments

The help menu can be accessed by calling the following from the command line:
$ xpresspipe makeReference --help
Required Arguments Description
-o <path>, --output <path> Path to output directory
-f <path>, --fasta <path> Path to genome fasta files (file names should end in .fa, .fasta, or .txt and no other files should exist in the directory with similar extensions)
-g </path/transcripts.gtf>, --gtf </path/transcripts.gtf> Path and file name to transcript reference file names ‘transcripts.gtf (DO NOT USE MODIFIED GTF HERE)’
Optional Arguments Description
--suppress_version_check Suppress version checks and other features that require internet access during processing
--sjdbOverhang <int> Specify length of genomic sequences for constructing splice-aware reference. Ideal length is read length - 1, so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of 100 should work in most cases
--genome_size <int> If mapping to an organism with a small genome, provide the length in nucleotides. If you are not sure your organism has a small genome, provide the number of bases and XPRESSpipe will decide if this parameter needs to be changed during runtime
-m Number of max processors to use for tasks (default: No limit)

Example 1: Create a single-end sequencing reference

- Paths to output and location of genome fasta files for each chromosome are provided, as well as path and file name to transcripts.gtf file
- Default number of threads are used for preparing reference
$ xpresspipe makeReference -o /path/to/reference/ -f /path/to/reference/ -g /path/to/reference/transcripts.gtf --sjdbOverhang 49

Example 2: Create a paired-end sequencing reference

- 12 threads are specified for reference creation
- The as 2x100bp paired-end sequencing was used, the default value for --sjdbOverhang of 100 is appropriate in this case
$ xpresspipe makeReference -o /path/to/reference/ -f /path/to/reference/ -g /path/to/reference/transcripts.gtf -t 12

Example 3: Create a single-end sequencing reference for Saccharomyces cerevisiae

- Paths to output and location of genome fasta files for each chromosome are provided, as well as path and file name to transcripts.gtf file
- Default number of threads are used for preparing reference
- Genome size is specified
$ xpresspipe makeReference -o /path/to/reference/ -f /path/to/reference/ -g /path/to/reference/transcripts.gtf --sjdbOverhang 49 --genome_size 3000000

Reference Modification

At times, quantification of transcripts or CDSs to a modified reference is desirable. Below are some examples:

1. As ribosomal RNA (rRNA) contamination is common in RNA-seq, even when a depletion step was performed prior to library preparation, it is sometimes desirable to not count these and other non-coding RNAs in the quantification and analysis.
2. During ribosome profiling library preparation, where a 5’ and 3’ pile-up of ribosome footprints due to slow initiation and termination kinetics of footprints is common, it is suggested to exclude the first 45-50 nucleotides from the 5’ end and 15 nucleotides from the 3’ end of each CDS during quantification. This command will automatically curate an Ensembl GTF to meet these demands for read quantification. If a UCSC-formatted GTF is desired, users should supply the --ucsc-format flag; however, the --longest_transcript flag will not work with a UCSC-formatted GTF as longest transcript definitions are dependent on Ensembl annotations.
3. Several genes encode multiple isoforms or transcripts. During quantification, many software packages for counting reads to genes consider a read mapping to multiple transcripts of the same gene as a multi-mapper. Unless interested in alternate isoform usage, it is recommended that transcriptome reference files only contain the longest transcript for each gene.
The modifyGTF sub-module provides the ability to make the above-mentioned modifications to a GTF reference file. The modified GTF file is output at the end and the filename is labeled with the modifications made. Truncations to each transcript or CDS reference are strand-aware.

Arguments

The help menu can be accessed by calling the following from the command line:
$ xpresspipe modifyGTF --help
Required Arguments Description
-g </path/transcripts.gtf>, --gtf </path/transcripts.gtf> Path and file name to reference GTF
Optional Arguments Description
--suppress_version_check Suppress version checks and other features that require internet access during processing
-l, --longest_transcript Provide argument to keep only longest transcript per gene record (not necessary except in cases where the Ensembl canonical transcript is desired)
-p, --protein_coding Provide argument to keep only gene records annotated as protein coding genes
-t, --truncate Provide argument to truncate the CDSs of gene records
--truncate_5prime Amount to truncate from 5’ end of each CDS, requires –truncate argument (default: 45)
--truncate_3prime Amount to truncate from 3’ end of each CDS, requires –truncate argument (default: 15)
--ucsc_format Input GTF is UCSC/refseq formatted. This flag only pertains to GTF modification, such as end truncation, not to STAR curation processes. Errors related to STAR GTF formatting need to be separately addressed.
-m Number of max processors to use for tasks (default: No limit)

Example 1: Create longest transcript, protein coding-only, truncated reference

- Keeps the longest transcript for each gene record
- Keeps only protein_coding annotated transcripts
- Truncates the first 45 nucleotides from the first exon of every CDS (default)
- Truncates the last 15 nucleotides from the last exon of every CDS (default)
- Each modification desired must be implicitly passed to the sub-module
$ xpresspipe modifyGTF -g /path/to/reference/transcripts.gtf --longest_transcript --protein_coding --truncate