Curating References¶

In order to quantify transcription levels from RNA-Seq data, reads must be mapped to a reference genome or transcriptome. While there are multiple alignment software packages available, XPRESSpipe performs this step using a current version of STAR for several reasons:

- Splice Junction Aware: STAR is capable of mapping reads spanning a splice junction, where more traditional packages, such as Bowtie, are incapable of doing so and are better suited for tasks such as genome alignment.
- Performance: While computationally greedy (a human genome alignment requires upwards of 30 Gb RAM), the performance and accuracy is excellent compared to the majority of other splice-aware aligners currently available
- Standard: The foundation of the pipeline used in XPRESSpipe is based in the TCGA standards for RNA-Seq alignment. This method utilizes a guided or 2-pass alignment program. In the guided alignment, a GTF with annotated splice junctions is used to guide the alignments over splice juntions. In the 2-pass alignment, reads are mapped across the genome to identify novel splice junctions. These new annotations are then incorporated into the reference index and reads are re-aligned with this new reference. While more time-intensive, this step can aid in aligning across these junctions, especially in organisms where the transcriptome is not as well annotated. If mapping to a well-documented organism, this step can be forgone and STAR will use the GTF annotations to determine intronic regions of transcripts for read mapping.

XPRESSpipe Reference Requirements¶

An XPRESSpipe compatible reference directory must meet some requirements:

- All chromosomal genome fasta files are in their own directory within the parent reference directory. If a FASTA file with all chromosomes combined is available for your organism, this can be provided, but must be in its own directory.
- A sub-directory, named genome, contains the STAR reference files. If createReference is used to curate the reference, and the parent reference directory was provided as output location, this directory creation and file formatting will be handled automatically by XPRESSpipe.
- A transcript reference (GTF), is located in the reference parent directory and is named transcripts.gtf. If a coding-only or truncated reference GTFs are desired for read quantification, these should also be in this directory (truncate will handle file naming and formatting so long as the output location is specified as this parent directory). This file will then need to be specified within an XPRESSpipe pipeline.

Note

A completed reference directory can be created that follows these requirements by creating a directory, placing the transcripts.gtf and genomic chromosome fasta files in the parent directory and running curateReference as described below**

Get Sequence Files¶

The following is an example of how to get the reference files needed for generating a human reference:

$ mkdir human_reference
$ mkdir human_reference/genome_fasta
$ cd human_reference/
$ curl ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.gtf.gz -o transcripts.gtf.gz
$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y; do curl -O ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.${i}.fa.gz; done
$ gzip -d *.gz
$ mv *fasta genome_fasta

The chromosome IDs may vary depending on your organism.

Note

We recommend against using the toplevel Ensembl files. In our experience, this leads to RAM issues in STAR.

Perform Full Reference Curation¶

The following will create a XPRESSpipe-formatted reference directory containing all STAR reference files and transcript references needs for quantification and meta-analysis.
A parent reference directory containing the transcripts.gtf file and all chromosomal genome fasta files must be present.

More details as to what each specific parameter is doing can be found in the sections below.*

Arguments¶

The help menu can be accessed by calling the following from the command line:

$ xpresspipe curateReference --help

Required Arguments	Description
`-o <path>, --output <path>`	Path to output directory
`-f <path>, --fasta <path>`	Path to genome fasta files (file names should end in .fa, .fasta, or .txt and no other files should exist in the directory with similar extensions)
`-g </path/transcripts.gtf>`, `--gtf </path/transcripts.gtf>`	Path and file name to transcript reference file names ‘transcripts.gtf’

Optional Arguments	Description
`--suppress_version_check`	Suppress version checks and other features that require internet access during processing
`-l, --longest_transcript`	Provide argument to keep only longest transcript per gene record (RECOMMENDED)
`-p, --protein_coding`	Provide argument to keep only gene records annotated as protein coding genes
`-t, --truncate`	Provide argument to truncate gene records
`--truncate_5prime`	Amount to truncate from 5’ end of each transcript, requires –truncate argument (default: 45)
`--truncate_3prime`	Amount to truncate from 3’ end of each transcript, requires –truncate argument (default: 15)
`--sjdbOverhang <value>`	Specify length of genomic sequences for constructing splice-aware reference. Ideal length is `read length - 1`, so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of `100` should work in most cases
`--genome_size <int>`	If mapping to an organism with a small genome, provide the length in nucleotides. If you are not sure your organism has a small genome, provide the number of bases and XPRESSpipe will decide if this parameter needs to be changed during runtime
`--ucsc_format`	Input GTF is UCSC/refseq formatted. This flag only pertains to GTF modification, such as end truncation, not to STAR curation processes. Errors related to STAR GTF formatting need to be separately addressed.
`-m`	Number of max processors to use for tasks (default: No limit)

Example 1: Create XPRESSpipe-formatted reference for single-end alignment¶

- Creates a star reference for single-end read mapping (1x50bp reads)
- Keeps the longest transcript for each gene record
- Keeps only protein_coding annotated transcripts
- Truncates the first 45 nucleotides from the first exon of every transcript (default)
- Truncates the last 15 nucleotides from the last exon of every transcript (default)

$ xpresspipe curateReference -o /path/to/se/ref/ -f /path/to/se/ref/ -g /path/to/se/ref/transcripts.gtf --longest_transcript --protein_coding --truncate --sjdbOverhang 49

Example 2: Create refFlat files¶

- Creates a star reference for paired-end read mapping (2x100bp reads)
- No modifications are made to the GTF file
- Processes are limited to 10 cores

$ xpresspipe curateReference -o /path/to/pe/ref/ -f /path/to/pe/ref/ -g /path/to/pe/ref/transcripts.gtf -m 10

STAR Reference Curation¶

The following creates a STAR reference compatible with XPRESSpipe. These files are output in a directory created during curation called genome in the specified --output directory.

Arguments¶

The help menu can be accessed by calling the following from the command line:

$ xpresspipe makeReference --help

Required Arguments	Description
`-o <path>, --output <path>`	Path to output directory
`-f <path>, --fasta <path>`	Path to genome fasta files (file names should end in .fa, .fasta, or .txt and no other files should exist in the directory with similar extensions)
`-g </path/transcripts.gtf>`, `--gtf </path/transcripts.gtf>`	Path and file name to transcript reference file names ‘transcripts.gtf (DO NOT USE MODIFIED GTF HERE)’

Optional Arguments	Description
`--suppress_version_check`	Suppress version checks and other features that require internet access during processing
`--sjdbOverhang <int>`	Specify length of genomic sequences for constructing splice-aware reference. Ideal length is `read length - 1`, so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of `100` should work in most cases
`--genome_size <int>`	If mapping to an organism with a small genome, provide the length in nucleotides. If you are not sure your organism has a small genome, provide the number of bases and XPRESSpipe will decide if this parameter needs to be changed during runtime
`-m`	Number of max processors to use for tasks (default: No limit)

Example 1: Create a single-end sequencing reference¶

- Paths to output and location of genome fasta files for each chromosome are provided, as well as path and file name to transcripts.gtf file

- Default number of threads are used for preparing reference

$ xpresspipe makeReference -o /path/to/reference/ -f /path/to/reference/ -g /path/to/reference/transcripts.gtf --sjdbOverhang 49

Example 2: Create a paired-end sequencing reference¶

- 12 threads are specified for reference creation

- The as 2x100bp paired-end sequencing was used, the default value for --sjdbOverhang of 100 is appropriate in this case

$ xpresspipe makeReference -o /path/to/reference/ -f /path/to/reference/ -g /path/to/reference/transcripts.gtf -t 12

Example 3: Create a single-end sequencing reference for Saccharomyces cerevisiae¶

- Paths to output and location of genome fasta files for each chromosome are provided, as well as path and file name to transcripts.gtf file
- Default number of threads are used for preparing reference
- Genome size is specified

$ xpresspipe makeReference -o /path/to/reference/ -f /path/to/reference/ -g /path/to/reference/transcripts.gtf --sjdbOverhang 49 --genome_size 3000000

Reference Modification¶

At times, quantification of transcripts or CDSs to a modified reference is desirable. Below are some examples:

1. As ribosomal RNA (rRNA) contamination is common in RNA-seq, even when a depletion step was performed prior to library preparation, it is sometimes desirable to not count these and other non-coding RNAs in the quantification and analysis.
2. During ribosome profiling library preparation, where a 5’ and 3’ pile-up of ribosome footprints due to slow initiation and termination kinetics of footprints is common, it is suggested to exclude the first 45-50 nucleotides from the 5’ end and 15 nucleotides from the 3’ end of each CDS during quantification. This command will automatically curate an Ensembl GTF to meet these demands for read quantification. If a UCSC-formatted GTF is desired, users should supply the --ucsc-format flag; however, the --longest_transcript flag will not work with a UCSC-formatted GTF as longest transcript definitions are dependent on Ensembl annotations.
3. Several genes encode multiple isoforms or transcripts. During quantification, many software packages for counting reads to genes consider a read mapping to multiple transcripts of the same gene as a multi-mapper. Unless interested in alternate isoform usage, it is recommended that transcriptome reference files only contain the longest transcript for each gene.
The modifyGTF sub-module provides the ability to make the above-mentioned modifications to a GTF reference file. The modified GTF file is output at the end and the filename is labeled with the modifications made. Truncations to each transcript or CDS reference are strand-aware.

Arguments¶

The help menu can be accessed by calling the following from the command line:

$ xpresspipe modifyGTF --help

Required Arguments	Description
`-g </path/transcripts.gtf>`, `--gtf </path/transcripts.gtf>`	Path and file name to reference GTF

Optional Arguments	Description
`--suppress_version_check`	Suppress version checks and other features that require internet access during processing
`-l, --longest_transcript`	Provide argument to keep only longest transcript per gene record (not necessary except in cases where the Ensembl canonical transcript is desired)
`-p, --protein_coding`	Provide argument to keep only gene records annotated as protein coding genes
`-t, --truncate`	Provide argument to truncate the CDSs of gene records
`--truncate_5prime`	Amount to truncate from 5’ end of each CDS, requires –truncate argument (default: 45)
`--truncate_3prime`	Amount to truncate from 3’ end of each CDS, requires –truncate argument (default: 15)
`--ucsc_format`	Input GTF is UCSC/refseq formatted. This flag only pertains to GTF modification, such as end truncation, not to STAR curation processes. Errors related to STAR GTF formatting need to be separately addressed.
`-m`	Number of max processors to use for tasks (default: No limit)

Example 1: Create longest transcript, protein coding-only, truncated reference¶

- Keeps the longest transcript for each gene record
- Keeps only protein_coding annotated transcripts
- Truncates the first 45 nucleotides from the first exon of every CDS (default)
- Truncates the last 15 nucleotides from the last exon of every CDS (default)
- Each modification desired must be implicitly passed to the sub-module

$ xpresspipe modifyGTF -g /path/to/reference/transcripts.gtf --longest_transcript --protein_coding --truncate