XPRESSpipe¶
About¶
Table of contents¶
Overview¶
Ribosome Profiling¶
5’ and 3’ ribosome footprint bias¶
xpresspipe modifyGTF
or xpresspipe curateReference
sub-modules with the flag -t
provided, the user can prepare the required files for appropriate footprint quantification.rRNA contamination¶
xpresspipe rrnaProbe
sub-module, one can determine what the dominant consensus rRNA species are and create depletion probes to prevent their incorporation into future sequence libraries.SE and PE RNA-seq¶
Software¶
fastp¶
STAR¶
Samtools, bedtools, deepTools¶
HTSeq¶
five_prime_utr
or three_prime_utr
for the --feature_type
option.Cufflinks¶
dupRadar¶
riboWaltz¶
SVA¶
DESeq2¶
MultiQC¶
Methodology¶
Transcriptomic Reference Files¶
xpresspipe modifyGTF -l
.xpresspipe modifyGTF -p
.xpresspipe modifyGTF -t
handles this by searching the exon space of each transcript and pruning the given amounts off of each so that these regions are considered non-coding space. This process is performed recursively, so that if you were trimming 45 nt from the 5’ end and exon 1 was only 30 nt, exon 1 would be removed and exon 2 would be trimmed by 15 nt.PCR De-Duplication¶
Meta-Analysis¶
Quickstart¶
Running XPRESSpipe¶
build
module after installation:$ xpresspipe build
Video Walkthroughs¶
Note
The pip install .
method has been replaced with a script that is executed by running bash install.sh
.
Beginner’s Guide¶
First Steps¶
Install XPRESSpipe¶
Note
The pip install .
method has been replaced with a script that is executed by running bash install.sh
.
Generate Reference Files¶
$ cd ~/Desktop
$ mkdir reference_folder
$ mkdir reference_folder/fasta_files
icon is a shortcut for the User directory, and every directory needs to be separated by a /
reference_folder
#
signs. For GTF_URL
, you should change the URL currently provided to the one appropriate for your organism of interest. Make sure you are downloading the GTF file and NOT the GFF file. For FASTA_URL
, you should do the same as before with the URL to the chromosome DNA FASTA files, but you should only copy the URL up to “chromosome”, but not include the chromosome identifier. For CHROMOSOMES
, type out the chromosome identifiers you want to download between the ” characters with a space between each.Note
I do not personally recommend using the toplevel
genome sequence files. Whenever I have used these, I often run into a memory overload error during genome curation.
$ cd reference_folder/
### Change specific organism file names based on your organism of interest ###
$ echo 'GTF_URL=ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz' >> fetch.sh
$ echo 'FASTA_URL=ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome' >> fetch.sh
$ echo 'CHROMOSOMES="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y"' >> fetch.sh
$ echo 'curl -O $GTF_URL' >> fetch.sh
$ echo 'gzip -d Homo_sapiens.GRCh38.97.gtf.gz' >> fetch.sh
$ echo 'mv Homo_sapiens.GRCh38.97.gtf transcripts.gtf' >> fetch.sh
$ echo 'cd fasta_files/' >> fetch.sh
$ echo 'for X in $CHROMOSOMES; ' >> fetch.sh
$ echo 'do curl -O ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.${X}.fa.gz; done ' >> fetch.sh
$ echo 'gzip -d *.gz' >> fetch.sh
$ echo 'cd ../' >> fetch.sh
$ bash fetch.sh
fasta_file
directory to download the raw reference data and unzipped it. Finally, we returned to the main reference directory.Homo_sapiens.GRCh38.dna.chromosome
. You can download them, move them to the appropriate directories within your reference directory, and unzip the files by double-clicking on them.--sjdbOverhang
argument below).$ xpresspipe curateReference \
--output ./ \
--fasta fasta_files/ \
--gtf ./transcripts.gtf \
--protein_coding \
--truncate \
--sjdbOverhang 49
### or ###
$ xpresspipe build
### And then choose the curate option ###
--protein_coding
argument.--sjdbOverhang
argument to be the length of one of the paired-end reads - 1, so if we ran 2x100bp sequencing, we would specify --sjdbOverhang 99
(although in this case, the default of --sjdbOverhang 100
is just fine). If you changed this number, remember this for the next steps as you will need to provide it again if changed here.--genome_size
parameter with the the number of nucleotides in the organism’s genome. If you change this parameter in this step, you will need to provide the parameter and value in the align
, riboseq
, seRNAseq
, and seRNAseq
modules.Process Raw Sequencing Files¶
raw_data
or whatever you like and place your data there.output
--sjdbOverhang
amount we fed into the reference curation step, so in this case we will use --sjdbOverhang 49
$ xpresspipe riboseq --input raw_data/ \
--output output/ \
--reference reference_folder/ \
--gtf reference_folder/transcripts_LCT.gtf
--experiment riboseq_test
--adapter CTGTAGGCACCATCAAT
--method RPKM
--sjdbOverhang 49
### or ###
$ xpresspipe build
### And then choose the appropriate pipeline to build
Sequencing Metrics¶
riboseq_test_multiqc_report.html
. This file will compile the statistics from each processing step of the pipeline for each sample file you provided as input. Things like read quality, mapping, and quantification statistics can be found here. Just double-click the file or execute the following command to open in your default browser window.$ open riboseq_test_multiqc_report.html
Library Complexity¶
complexity
directory in your output folder, you will find summary PDFs for all samples processed analyzing library complexity of each sample.Metagene Analysis¶
metagene
directory in your output folder, you will find summary PDFs for all samples processed analyzing the metagene profile of each sample.Periodicity (Ribosome Profiling)¶
periodicity
directory in your output folder, you will find summary PDFs for all samples processed analyzing ribosome periodicity of each of each sample containing reads 28-30nt.Count Data and Downstream Analysis¶
counts
directory in your output folder, you will find individual counts tables for each sample, as well as compiled tables for each sample that was processed.Supercomputing¶
Install¶
$ cd ~
$ curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
$ git clone https://github.com/XPRESSyourself/XPRESSpipe.git
$ conda env create -f ./XPRESSpipe/requirements.yml
$ conda activate xpresspipe
$ pip install ./XPRESSpipe
$ cd ~
$ xpresspipe test
Run Data¶
xpresspipe
, you will use the following as a template. If you named the conda environment something else, you would replace the line conda activate xpresspipe
with conda activate env_name
. If dependencies were installed to the base environment, the source $(conda...
and conda activate ...
lines are unnecessary.#!/bin/bash
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH -o /scratch/general/lustre/$USER/slurmjob-%j
#SBATCH --partition=this_cluster_has_no_name
source $(conda info --base)/etc/profile.d/conda.sh
source activate xpresspipe
#set up the temporary directory
SCRDIR=/scratch/general/lustre/$USER/$SLURM_JOBID
mkdir -p $SCRDIR
# Provide location of raw data and parent reference directory
SRA=/scratch/general/lustre/$USER/files/your_favorite_experiment_goes_here
REF=/scratch/general/lustre/$USER/references/fantastic_creature_reference
# Send raw data to your Scratch directory
mkdir $SCRDIR/input
cp $SRA/*.fastq $SCRDIR/input/.
# Make an output directory
mkdir $SCRDIR/output
cd $SCRDIR/.
xpresspipe riboseq -i $SCRDIR/input -o $SCRDIR/output/ -r $REF --gtf $REF/transcripts_CT.gtf -e this_is_a_test -a CTGTAGGCACCATCAAT --sjdbOverhang
$ sbatch my_batch_script.sh
$ watch -n1 squeue -u $USER
Explore the Data¶
$ cd ~/Desktop # Or any other location where you want to store and analyze the data
$ scp USERNAME@myCluster.chpc.university.edu:/full/path/to/files/file_name.suffix ./
Installation¶
Install XPRESSpipe¶
# If on a MacOS
$ curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
# If on a LinuxOS
$ curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash ~/Miniconda3-latest-MacOSX-x86_64.sh
# Enter yes for all successive prompts and allow the script to install Conda into your path
# After installation, the install script can be removed
rm ~/Miniconda3-latest-MacOSX-x86_64.sh
releases
tab on the XPRESSpipe GitHub repository).$ cd ~
$ curl -L -O https://github.com/XPRESSyourself/XPRESSpipe/archive/refs/tags/v0.6.3.zip
$ unzip XPRESSpipe-v0.6.3.zip
$ cd XPRESSpipe-v0.6.3/
Note
20 Oct 2021 - Currently, base conda is having issues resolving the dependencies required for XPRESSpipe. We recommend installing dependencies using mamba instead, which appears to resolve dependencies without issues. mamba is also conveniently faster than base conda.
$ conda env create -f requirements.yml
$ conda activate xpresspipe
$ conda install -c conda-forge mamba
$ mamba env create -f requirements.yml
$ conda activate xpresspipe
conda activate xpresspipe
to use XPRESSpipeNote
v0.6.3
and later employs the bash install.sh
method for installing XPRESSpipe. If using v0.6.2
or earlier, you should instead run pip install .
$ bash install.sh
$ xpresspipe test
xpresspipe --help
to see a list of the available modules within XPRESSpipe. To see specific parameters for a module, type xpresspipe <module_name> --help
.Install in a supercomputing environment¶
#SBATCH
lines and before any calls to XPRESSpipe in the slurm script, as below:#!/bin/bash
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH ...
...
source $(conda info --base)/etc/profile.d/conda.sh
source activate xpresspipe
... rest of the script
General Usage¶
fastq
file. However, the suffix for these files can be fq
or txt
as well. They can be zipped (zip
or gz
) or unzipped. When using intermediate sub-modules, such as align
or readDistribution
, input will vary and is explicatedin the --help
menu for each sub-module.
File Naming¶
fastq
, fastq.gz
, fq
, fq.gz
, txt
, txt.gz
. We recommend the fastq
or fastq.gz
suffix.Step 2
apply, but must the suffix must be prefaced by the paired read group number as below:ExperimentName_Rep1_a_WT.r1.fastq.gz
ExperimentName_Rep1_a_WT.r2.fastq.gz
ExperimentName_Rep2_a_WT.r1.fastq.gz
ExperimentName_Rep2_a_WT.r2.fastq.gz
ExperimentName_Rep1_a_WT.read1.fastq.gz
ExperimentName_Rep1_a_WT.read2.fastq.gz
ExperimentName_Rep2_a_WT.read1.fastq.gz
ExperimentName_Rep2_a_WT.read2.fastq.gz
Data Output¶
Running seRNAseq
, peRNAseq
, or riboseq
will output all intermediate and final data files as shown in this schematic:

Curating References¶
XPRESSpipe Reference Requirements¶
genome
, contains the STAR reference files. If createReference
is used to curate the reference, and the parent reference directory was provided as output location, this directory creation and file formatting will be handled automatically by XPRESSpipe.transcripts.gtf
. If a coding-only or truncated reference GTFs are desired for read quantification, these should also be in this directory (truncate
will handle file naming and formatting so long as the output location is specified as this parent directory). This file will then need to be specified within an XPRESSpipe pipeline.Note
A completed reference directory can be created that follows these requirements by creating a directory, placing the transcripts.gtf and genomic chromosome fasta files in the parent directory and running curateReference
as described below**
Get Sequence Files¶
$ mkdir human_reference
$ mkdir human_reference/genome_fasta
$ cd human_reference/
$ curl ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.gtf.gz -o transcripts.gtf.gz
$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y; do curl -O ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.${i}.fa.gz; done
$ gzip -d *.gz
$ mv *fasta genome_fasta
Note
We recommend against using the toplevel
Ensembl files. In our experience, this leads to RAM issues in STAR.
Perform Full Reference Curation¶
Arguments¶
$ xpresspipe curateReference --help
Required Arguments | Description |
---|---|
-o <path>, --output <path> |
Path to output directory |
-f <path>, --fasta <path> |
Path to genome fasta files (file names should end in .fa, .fasta, or .txt and no other files should exist in the directory with similar extensions) |
-g </path/transcripts.gtf> , --gtf </path/transcripts.gtf> |
Path and file name to transcript reference file names ‘transcripts.gtf’ |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
-l, --longest_transcript |
Provide argument to keep only longest transcript per gene record (RECOMMENDED) |
-p, --protein_coding |
Provide argument to keep only gene records annotated as protein coding genes |
-t, --truncate |
Provide argument to truncate gene records |
--truncate_5prime |
Amount to truncate from 5’ end of each transcript, requires –truncate argument (default: 45) |
--truncate_3prime |
Amount to truncate from 3’ end of each transcript, requires –truncate argument (default: 15) |
--sjdbOverhang <value> |
Specify length of genomic sequences for constructing splice-aware reference. Ideal length is read length - 1 , so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of 100 should work in most cases |
--genome_size <int> |
If mapping to an organism with a small genome, provide the length in nucleotides. If you are not sure your organism has a small genome, provide the number of bases and XPRESSpipe will decide if this parameter needs to be changed during runtime |
--ucsc_format |
Input GTF is UCSC/refseq formatted. This flag only pertains to GTF modification, such as end truncation, not to STAR curation processes. Errors related to STAR GTF formatting need to be separately addressed. |
-m |
Number of max processors to use for tasks (default: No limit) |
Example 1: Create XPRESSpipe-formatted reference for single-end alignment¶
$ xpresspipe curateReference -o /path/to/se/ref/ -f /path/to/se/ref/ -g /path/to/se/ref/transcripts.gtf --longest_transcript --protein_coding --truncate --sjdbOverhang 49
Example 2: Create refFlat files¶
$ xpresspipe curateReference -o /path/to/pe/ref/ -f /path/to/pe/ref/ -g /path/to/pe/ref/transcripts.gtf -m 10
STAR Reference Curation¶
genome
in the specified --output
directory.Arguments¶
$ xpresspipe makeReference --help
Required Arguments | Description |
---|---|
-o <path>, --output <path> |
Path to output directory |
-f <path>, --fasta <path> |
Path to genome fasta files (file names should end in .fa, .fasta, or .txt and no other files should exist in the directory with similar extensions) |
-g </path/transcripts.gtf> , --gtf </path/transcripts.gtf> |
Path and file name to transcript reference file names ‘transcripts.gtf (DO NOT USE MODIFIED GTF HERE)’ |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
--sjdbOverhang <int> |
Specify length of genomic sequences for constructing splice-aware reference. Ideal length is read length - 1 , so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of 100 should work in most cases |
--genome_size <int> |
If mapping to an organism with a small genome, provide the length in nucleotides. If you are not sure your organism has a small genome, provide the number of bases and XPRESSpipe will decide if this parameter needs to be changed during runtime |
-m |
Number of max processors to use for tasks (default: No limit) |
Example 1: Create a single-end sequencing reference¶
$ xpresspipe makeReference -o /path/to/reference/ -f /path/to/reference/ -g /path/to/reference/transcripts.gtf --sjdbOverhang 49
Example 2: Create a paired-end sequencing reference¶
--sjdbOverhang
of 100
is appropriate in this case$ xpresspipe makeReference -o /path/to/reference/ -f /path/to/reference/ -g /path/to/reference/transcripts.gtf -t 12
Example 3: Create a single-end sequencing reference for Saccharomyces cerevisiae¶
$ xpresspipe makeReference -o /path/to/reference/ -f /path/to/reference/ -g /path/to/reference/transcripts.gtf --sjdbOverhang 49 --genome_size 3000000
Reference Modification¶
--ucsc-format
flag; however, the --longest_transcript
flag will not work with a UCSC-formatted GTF as longest transcript definitions are dependent on Ensembl annotations.modifyGTF
sub-module provides the ability to make the above-mentioned modifications to a GTF reference file. The modified GTF file is output at the end and the filename is labeled with the modifications made. Truncations to each transcript or CDS reference are strand-aware.Arguments¶
$ xpresspipe modifyGTF --help
Required Arguments | Description |
---|---|
-g </path/transcripts.gtf> , --gtf </path/transcripts.gtf> |
Path and file name to reference GTF |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
-l, --longest_transcript |
Provide argument to keep only longest transcript per gene record (not necessary except in cases where the Ensembl canonical transcript is desired) |
-p, --protein_coding |
Provide argument to keep only gene records annotated as protein coding genes |
-t, --truncate |
Provide argument to truncate the CDSs of gene records |
--truncate_5prime |
Amount to truncate from 5’ end of each CDS, requires –truncate argument (default: 45) |
--truncate_3prime |
Amount to truncate from 3’ end of each CDS, requires –truncate argument (default: 15) |
--ucsc_format |
Input GTF is UCSC/refseq formatted. This flag only pertains to GTF modification, such as end truncation, not to STAR curation processes. Errors related to STAR GTF formatting need to be separately addressed. |
-m |
Number of max processors to use for tasks (default: No limit) |
Example 1: Create longest transcript, protein coding-only, truncated reference¶
$ xpresspipe modifyGTF -g /path/to/reference/transcripts.gtf --longest_transcript --protein_coding --truncate
Single-End RNA-seq Pipeline¶
Arguments¶
$ xpresspipe seRNAseq --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to input directory – if paired-end, file names should be exactly the same except for r1/r2.fastq or similar suffix |
-o <path>, --output <path> |
Path to output directory |
-r <path>, --reference <path> |
Path to parent organism reference directory |
-g </path/transcripts.gtf> , --gtf </path/transcripts.gtf> |
Path and file name to GTF used for alignment quantification (only used for HTSeq quantification) |
-e , --experiment |
Experiment name |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
--two-pass |
Use a two-pass STAR alignment for novel splice junction discovery |
-a <adapter1 ...> [<adapter1 ...> ...] , --adapter <adapter1 ...> [<adapter1 ...> ...] |
Specify adapter(s) in list of strings – for single-end, only provide one adapter – if None are provided, software will attempt to auto-detect adapters – if “POLYX” is provided as a single string in the list, polyX adapters will be trimmed. If you want to auto-detect adapters in for paired-end reads, provide None twice |
-q <PHRED_value>, --quality <PHRED_value> |
PHRED read quality threshold (default: 28 ) |
--min_length <length_value> |
Minimum read length threshold to keep for reads (default: 17 ) |
--max_length <length_value> |
Maximum read length threshold to keep for reads (default: 0 ). Setting this argument to 0 will result in no upper length limit. |
--remove_rrna |
Provide flag to remove rRNA records from alignment files (BAM files) |
--front_trim <length> |
Number of base pairs to trim from the 5’ ends of reads (not available for polyX trimming) (default: 1) |
--umi_location <location> |
Provide parameter to process UMIs – provide location (if working with internal UMIs that need to be processed after adapter trimming, provide “3prime”; else see fastp documentation for more details, generally for single-end sequencing, you would provide ‘read1’ here; does not work with -a polyX option) |
--umi_length <length> |
Provide parameter to process UMIs – provide UMI length (must provide the –umi_location argument); does not work with -a polyX option) |
--spacer_length <length> |
Provide UMI spacer length, if exists. (default: 0) |
--no_multimappers> |
Include flag to remove multimapping reads to be output and used in downstream analyses |
--deduplicate |
Include flag to quantify reads with de-duplication (will search for files with suffix _dedupRemoved.bam ) |
--output_bed |
Include flag to output BED files for each aligned file |
-c , --quantification_method |
Specify quantification method (default: htseq; other option: cufflinks. If using Cufflinks, no downstream sample normalization is required) |
--feature_type <feature> |
Specify feature type (3rd column in GTF file) to be used if quantifying with htseq (default: CDS) |
--stranded <fr-unstranded/fr-firststrand /fr-secondstrand||no/yes> |
Specify whether library preparation was stranded (Options before || correspond with Cufflinks inputs, options after correspond with htseq inputs) |
--method <RPM, RPKM, FPKM, TPM> |
Normalization method to perform (options: “RPM”, “TPM”, “RPKM”, “FPKM”) – if using either TPM, RPKM, or FPKM, a GTF reference file must be included |
--vcf </path/to/file.vcf> |
Provide full path and file name to VCF file if you would like detect personal variants overlapping alignments |
--batch </path/filename.tsv> |
Include path and filename of dataframe with batch normalization parameters |
--sjdbOverhang <sjdbOverhang_amount> |
Specify length of genomic sequences for constructing splice-aware reference. Ideal length is read length - 1 , so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of 100 should work in most cases |
--mismatchRatio <mismatchRatio> |
Alignment ratio of mismatches to mapped length is less than this value. See STAR documentation for more information on setting this parameter |
--seedSearchStartLmax <seedSearchStartLmax> |
Adjusting this parameter by providing a lower number will improve mapping sensitivity (recommended value = 15 for reads ~ 25 nts). See STAR documentation for more information on setting this parameter |
genome_size |
Only needs to be changed if this argument was provided curing reference building AND using a two-pass alignment. This should be the length of the organism’s genome in nucleotides |
-m |
Number of max processors to use for tasks (default: No limit) |
Example 1: Run pipeline on single-end RNA-seq sample files¶
$ xpresspipe seRNAseq \
-i se_test \
-o se_out \
-r se_reference \
--gtf transcripts_LC.gtf \
-e se_test \
-a CTGTAGGCACCATCAAT \
--method TPM \
--sjdbOverhang 49
Paired-End RNA-seq Pipeline¶
Arguments¶
$ xpresspipe peRNAseq --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to input directory – if paired-end, file names should be exactly the same except for r1/r2.fastq or similar suffix |
-o <path>, --output <path> |
Path to output directory |
-r <path>, --reference <path> |
Path to parent organism reference directory |
-g </path/transcripts.gtf> , --gtf </path/transcripts.gtf> |
Path and file name to GTF used for alignment quantification (only used for HTSeq quantification) |
-e , --experiment |
Experiment name |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
--two-pass |
Use a two-pass STAR alignment for novel splice junction discovery |
-a <adapter1 ...> [<adapter1 ...> ...] , --adapter <adapter1 ...> [<adapter1 ...> ...] |
Specify adapter(s) in list of strings – for single-end, only provide one adapter – if None are provided, software will attempt to auto-detect adapters – if “POLYX” is provided as a single string in the list, polyX adapters will be trimmed. If you want to auto-detect adapters in for paired-end reads, provide None twice |
-q <PHRED_value>, --quality <PHRED_value> |
PHRED read quality threshold (default: 28 ) |
--min_length <length_value> |
Minimum read length threshold to keep for reads (default: 17 ) |
--max_length <length_value> |
Maximum read length threshold to keep for reads (default: 0 ). Setting this argument to 0 will result in no upper length limit. |
--remove_rrna |
Provide flag to remove rRNA records from alignment files (BAM files) |
--front_trim <length> |
Number of base pairs to trim from the 5’ ends of reads (not available for polyX trimming) (default: 1) |
--umi_location <location> |
Provide parameter to process UMIs – provide location (see fastp documentation for more details, generally for single-end sequencing, you would provide ‘read1’ here; does not work with -a polyX option) |
--umi_length <length> |
Provide parameter to process UMIs – provide UMI length (must provide the –umi_location argument); does not work with -a polyX option) |
--spacer_length <length> |
Provide UMI spacer length, if exists. (default: 0) |
--no_multimappers> |
Include flag to remove multimapping reads to be output and used in downstream analyses |
--deduplicate |
Include flag to quantify reads with de-duplication (will search for files with suffix _dedupRemoved.bam ) |
--output_bed |
Include flag to output BED files for each aligned file |
-c , --quantification_method |
Specify quantification method (default: htseq; other option: cufflinks. If using Cufflinks, no downstream sample normalization is required) |
--feature_type <feature> |
Specify feature type (3rd column in GTF file) to be used if quantifying with htseq (default: CDS) |
--stranded <fr-unstranded/fr-firststrand /fr-secondstrand||no/yes> |
Specify whether library preparation was stranded (Options before || correspond with Cufflinks inputs, options after correspond with htseq inputs) |
--method <RPM, RPKM, FPKM, TPM> |
Normalization method to perform (options: “RPM”, “TPM”, “RPKM”, “FPKM”) – if using either TPM, RPKM, or FPKM, a GTF reference file must be included |
--vcf </path/to/file.vcf> |
Provide full path and file name to VCF file if you would like detect personal variants overlapping alignments |
--batch </path/filename.tsv> |
Include path and filename of dataframe with batch normalization parameters |
--sjdbOverhang <sjdbOverhang_amount> |
Specify length of genomic sequences for constructing splice-aware reference. Ideal length is read length - 1 , so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of 100 should work in most cases |
--mismatchRatio <mismatchRatio> |
Alignment ratio of mismatches to mapped length is less than this value. See STAR documentation for more information on setting this parameter |
--seedSearchStartLmax <seedSearchStartLmax> |
Adjusting this parameter by providing a lower number will improve mapping sensitivity (recommended value = 15 for reads ~ 25 nts). See STAR documentation for more information on setting this parameter |
genome_size |
Only needs to be changed if this argument was provided curing reference building AND using a two-pass alignment. This should be the length of the organism’s genome in nucleotides |
-m |
Number of max processors to use for tasks (default: No limit) |
Example 1: Run pipeline on paired-end RNA-seq sample files¶
$ xpresspipe peRNAseq \
-i pe_test \
-o pe_out \
-r pe_reference \
--gtf transcripts.gtf \
-e pe_test \
-a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
--method TPM \
--sjdbOverhang 100
Ribosome Profiling Pipeline¶
Arguments¶
$ xpresspipe riboseq --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to input directory – if paired-end, file names should be exactly the same except for r1/r2.fastq or similar suffix |
-o <path>, --output <path> |
Path to output directory |
-r <path>, --reference <path> |
Path to parent organism reference directory |
-g </path/transcripts.gtf> , --gtf </path/transcripts.gtf> |
Path and file name to GTF used for alignment quantification (only used for HTSeq quantification) |
-cdna_fasta </path/cdna_fasta.fa> |
Path and file name to reference cDNA FASTA file for P-site reference generation/location |
-e , --experiment |
Experiment name |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
--two-pass |
Use a two-pass STAR alignment for novel splice junction discovery |
-a <adapter1 ...> [<adapter1 ...> ...] , --adapter <adapter1 ...> [<adapter1 ...> ...] |
Specify adapter(s) in list of strings – for single-end, only provide one adapter – if None are provided, software will attempt to auto-detect adapters – if “POLYX” is provided as a single string in the list, polyX adapters will be trimmed. If you want to auto-detect adapters in for paired-end reads, provide None twice |
-q <PHRED_value>, --quality <PHRED_value> |
PHRED read quality threshold (default: 28 ) |
--min_length <length_value> |
Minimum read length threshold to keep for reads (default: 17 ) |
--max_length <length_value> |
Maximum read length threshold to keep for reads (default: 0 ). Setting this argument to 0 will result in no upper length limit. |
--remove_rrna |
Provide flag to remove rRNA records from alignment files (BAM files) |
--front_trim <length> |
Number of base pairs to trim from the 5’ ends of reads (not available for polyX trimming) (default: 1) |
--umi_location <location> |
Provide parameter to process UMIs – provide location (if working with internal UMIs that need to be processed after adapter trimming, provide “3prime”; else see fastp documentation for more details, generally for single-end sequencing, you would provide ‘read1’ here; does not work with -a polyX option) |
--umi_length <length> |
Provide parameter to process UMIs – provide UMI length (must provide the –umi_location argument); does not work with -a polyX option) |
--spacer_length <length> |
Provide UMI spacer length, if exists. (default: 0) |
--no_multimappers> |
Include flag to remove multimapping reads to be output and used in downstream analyses |
--deduplicate |
Include flag to quantify reads with de-duplication (will search for files with suffix _dedupRemoved.bam ) |
--output_bed |
Include flag to output BED files for each aligned file |
-c , --quantification_method |
Specify quantification method (default: htseq; other option: cufflinks. If using Cufflinks, no downstream sample normalization is required) |
--feature_type <feature> |
Specify feature type (3rd column in GTF file) to be used if quantifying with htseq (default: CDS) |
--stranded <fr-unstranded/fr-firststrand /fr-secondstrand||no/yes> |
Specify whether library preparation was stranded (Options before || correspond with Cufflinks inputs, options after correspond with htseq inputs) |
--method <RPM, RPKM, FPKM, TPM> |
Normalization method to perform (options: “RPM”, “TPM”, “RPKM”, “FPKM”) – if using either TPM, RPKM, or FPKM, a GTF reference file must be included |
--vcf </path/to/file.vcf> |
Provide full path and file name to VCF file if you would like detect personal variants overlapping alignments |
--batch </path/filename.tsv> |
Include path and filename of dataframe with batch normalization parameters |
--sjdbOverhang <sjdbOverhang_amount> |
Specify length of genomic sequences for constructing splice-aware reference. Ideal length is read length - 1 , so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of 100 should work in most cases |
--mismatchRatio <mismatchRatio> |
Alignment ratio of mismatches to mapped length is less than this value. See STAR documentation for more information on setting this parameter |
--seedSearchStartLmax <seedSearchStartLmax> |
Adjusting this parameter by providing a lower number will improve mapping sensitivity (recommended value = 15 for reads ~ 25 nts). See STAR documentation for more information on setting this parameter |
genome_size |
Only needs to be changed if this argument was provided curing reference building AND using a two-pass alignment. This should be the length of the organism’s genome in nucleotides |
-m |
Number of max processors to use for tasks (default: No limit) |
Example 1: Run pipeline on ribosome profiling sample files¶
$ xpresspipe riboseq \
-i riboprof_test \
-o ribopipe_out \
-r se_reference \
--gtf se_reference/transcript_CT.gtf \
--cdna_fasta se_reference/cdna_seqs.fa \
-e riboprof_test \
-a CTGTAGGCACCATCAAT \
--method RPM \
--sjdbOverhang 49
Example 2: Run pipeline on ribosome profiling sample files with UMIs¶
riboseq
sub-module. In this case, they use a 5 nucleotide UMI that is found at the 3’-end of each read, so the --umi_location 3prime
and --umi_length 5
options should be used. If a UMI spacer is part of the UMI structure, this can be provided with the --umi_spacer
option with the spacer length as input.$ xpresspipe riboseq \
-i riboprof_test \
-o ribopipe_out \
-r se_reference \
--gtf se_reference/transcript_CT.gtf \
--cdna_fasta se_reference/cdna_seqs.fa \
-e riboprof_test \
-a CTGTAGGCACCATCAAT \
--method RPM \
--sjdbOverhang 49 \
--umi_location 3prime \
--umi_length 5 \
--umi_spacer 0
Quality Control¶
Read Distribution Analysis¶
fastq
samples within a given directory. It is recommended this analysis be performed on trimmed reads to clean up adapters and get the true distribution of sequence reads in the library. When this is run within the pipeline, it will analyze just the post-trimming fastq
files.html
file for general summary statistics, which include read length distributions for all samples.Arguments¶
$ xpresspipe readDistribution --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to input directory of trimmed fastq (or untrimmed fastq) files |
-o <path>, --output <path> |
Path to output directory |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
-t <SE or PE>, --type <SE or PE> |
Sequencing type (“SE” for single-end, “PE” for paired-end) |
-e <experiment_name>, --experiment <experiment_name> |
Experiment name |
-m |
Number of max processors to use for tasks (default: No limit) |
Example 1: Analyze read distributions from ribosome profiling libraries¶
$ xpresspipe readDistribution -i riboprof_out/trimmed_fastq -o riboprof_out -e se_test

Metagene Analysis¶
$ xpresspipe metagene --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to input directory of transcriptome-mapped BAM files |
-o <path>, --output <path> |
Path to output directory |
-g </path/transcripts.gtf> , --gtf </path/transcripts.gtf> |
Path and file name to un-modified reference GTF |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
-e <experiment_name>, --experiment <experiment_name> |
Experiment name |
--feature_type <feature_type> |
Specify feature type (3rd column in GTF file) to be used in calculating metagene coverage (default: exon; alternative: CDS) |
--bam_suffix <suffix> |
Change from default suffix of toTranscriptome.out.bam if transcriptome-mapped files were processed outside of XPRESSpipe |
-m <processors>, --max_processors <processors> |
Number of max processors to use for tasks (default: No limit) |
Example 1: Analyze metagene profiles of sequence libraries¶
$ xpresspipe metagene -i riboprof_out/alignments/ -o riboprof_out -g se_reference/transcripts.gtf -e se_test

Note
As you can probably see, there are systematic 5’ biases in these library preparations. A good RNA-seq library should generally have even coverage across all transcript positions.
Intron-collapsed Gene Coverage Analysis¶
$ xpresspipe geneCoverage --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to input directory of transcriptome-aligned BAM files |
-o <path>, --output <path> |
Path to output directory |
-g </path/transcripts.gtf> , --gtf </path/transcripts.gtf> |
Path and file name to reference GTF |
-n <gene_name>, --gene_name <gene_name> |
Gene name (case sensitive) |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
-e <experiment_name>, --experiment <experiment_name> |
Experiment name to save output summaries as |
--bam_suffix <suffix> |
Change from default suffix of toTranscriptome.out.bam if using a different BAM file |
--type <type> |
Record type to map across (i.e. “exon”, “CDS”) (case-sensitive) |
--samples <sample_list> [<sample_list> ...] |
Provide a space-separated list of sample names to include in analysis (will only include those listed, and will plot in the order listed) |
--sample_names <suffix> |
Provide a space-separated list of sample names to use for labels |
--plot_color <color> |
Indicate plotting color |
-m <processors>, --max_processors <processors> |
Number of max processors to use for tasks (default: No limit) |
Example 1: Analyze gene coverage profile of sequence libraries¶
sort.bam
--samples
, these files will be plotted alphabetically, so the listed order should also be alphabetical. If using --samples
, need to specify names in the same order you provided for this argument.$ xpresspipe geneCoverage -i /path/to/bam_files -o ./ -g /path/to/reference.gtf \
-n SLC1A1 --type exon --bam_suffix .sort.bam \
--sample_names SRR1795425 SRR1795433 SRR1795435 SRR1795437

Note
The coverage estimations use a 20 nt rolling window mean method to smoothen the coverage plots. In both A and B in the image above, the top plot was generated with IGV (https://software.broadinstitute.org/software/igv/) and the bottom with xpresspipe geneCoverage
. Green boxes show approximately the same region for comparison.
P-site Analysis¶
$ xpresspipe p_sites --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to input directory of transcriptome-aligned BAM files |
-o <path>, --output <path> |
Path to output directory |
-g </path/transcripts.gtf> , --gtf </path/transcripts.gtf> |
Path and file name to reference GTF |
-cdna_fasta </path/cdna_fasta.fa> |
Path and file name to reference cDNA FASTA file for P-site reference generation/location |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
--min_length <length_value> |
Minimum read length threshold to keep for reads (default: 17 ) |
--max_length <length_value> |
Maximum read length threshold to keep for reads (default: 0 ). Setting this argument to 0 will result in no upper length limit. |
-e <experiment_name>, --experiment <experiment_name> |
Experiment name to save output summaries as |
--bam_suffix <suffix> |
Change from default suffix of toTranscriptome.out.bam if using a different BAM file |
-m <processors>, --max_processors <processors> |
Number of max processors to use for tasks (default: No limit) |
Example 1: Analyze P-sites from ribosome profiling libraries¶
$ xpresspipe p_sites \
-i riboprof_out/alignments \
-o riboprof_out \
-g se_reference/transcripts.gtf \
-e se_test
Analysis¶
Differential Expression Analysis¶
Note
If intending to use the diffxpress
sub-module, you need to have used --quantification_method htseq
during read quantification as DESeq2 requires integer count data.
Sample Factor Files¶
sample_info
file. For example, if you wereevaluating a experimental vs control experiment for RNA-sequencing, you would provide a sample_info
file as follows:
**sample_info.txt**
Sample Condition
s1_rna a_WT
s2_rna a_WT
s3_rna b_EXP
s4_rna b_EXP
sample_info
file must be first alphabetically. In the case provided above, we want to compare the experimental condition VS the wild-type control condition, however these labels are not alphabetical. In this case, you can append letters to the beginning to force alphabetical order. For example, if you performed a experiment
vs wild-type
experiment, you would need to use the labels b_experiment
vs a_wild-type
to force a b_experiment
/ a_wild-type
comparison.sample_info
file. Since we want to perform another comparison with the footprint vs RNA-sequencing samples, we need to again ensure that these labels for this “Type” factor are listed in the correct alphabetical order to ensure we are performing a footprint VS RNA-sequencing comparison to reflect translation efficiency.**sample_info.txt**
Sample Condition Type
s1_fp a_WT RPF
s1_rna a_WT RNA
s2_fp a_WT RPF
s2_rna a_WT RNA
s3_fp b_EXP RPF
s3_rna b_EXP RNA
s4_fp b_EXP RPF
s4_rna b_EXP RNA
Note
As stated in the DESeq2 documentation: With no additional arguments to results, the log2 fold change and Wald test p value will be for the last variable in the design formula, and if this is a factor, the comparison will be the last level of this variable over the reference level (see previous note on factor levels). However, the order of the variables of the design do not matter so long as the user specifies the comparison to build a results table for, using the name or contrast arguments of results.
Type+Condition+Type:Condition
where we are interesting in testing the difference in translation efficiencies between conditions, the interaction Type:Condition
is the coefficient being tested for differential expression. The model as designed will also account for differences only seen in the Type
or Condition
co-variates alone (for example, a specific bias to ribosome footprints vs. mRNA fragments).Arguments¶
$ xpresspipe diffxpress --help
Required Arguments | Description |
---|---|
-i <path/filename.tsv> , --input <path/filename.tsv> |
Path and file name of expression counts matrix |
-s <path/filename.tsv> , --sample <path/filename.tsv> |
Path and file name of sample information matrix |
--design <formula> |
Design formula for differential expression analysis (spaces in command line are conserved in input string. DO NOT INCLUDE ~ OR SPACES IN FORMULA IN COMMAND LINE, will be automatically added) |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
--shrink |
Provide argument to perform shrinkage of effect size on log fold changes. Useful for visualization and ranking of hits |
Example 1: Analyze ribosome profiling data¶
Condition
and Type
factor columns in the sample_info
file. If we want to include the RPF
/ RNA
comparison to account for translation efficiency, we would need to include these factor label as a column to ensure the appropriate RPF
/ RNA
evaluation. To perform a comparison between Tm-treated and Untreated cells, we will provide the TM
and UNTR
labels for the Condition
factor. With the provided design formula used below, we will be calculating:**tm_counts.tsv**
ribo_untr_a ribo_untr_b ribo_tm_a ribo_tm_b untr_a_hek untr_b_hek tm_a_hek tm_b_hek
A1BG 29 43 21 11 67 73 56 85
A2M 3 5 2 2 73 57 32 37
AAAS 1441 1981 934 601 1144 1067 1012 1124
AACS 575 727 310 192 351 335 220 291
AADAT 98 120 51 29 322 315 192 292
**tm_deseq.txt**
Sample Condition Type
untr_a_hek UNTR RNA
untr_b_hek UNTR RNA
ribo_untr_a UNTR RPF
ribo_untr_b UNTR RPF
tm_a_hek TM RNA
tm_b_hek TM RNA
ribo_tm_a TM RPF
ribo_tm_b TM RPF
$ xpresspipe diffxpress -i counts_data.tsv --sample sample_info.txt --design Type+Condition+Type:Condition
TM
vs UNTR
and RPF
(footprints) vs RNA
.**tm_counts_diffx.tsv**
baseMean log2FoldChange lfcSE stat pvalue padj
ATF4 3283.072674 2.542784311 0.134284453 18.93580577 5.78E-80 5.03E-76
PTP4A1 460.6444433 2.473962772 0.185061193 13.36834986 9.26E-41 4.03E-37
SPEN 7902.554413 1.192124338 0.109445545 10.89239713 1.25E-27 3.63E-24
RPS15A 1823.967865 -1.391099082 0.152069954 -9.147757652 5.81E-20 1.26E-16
DYNC1H1 11985.60418 0.85282198 0.094425503 9.031691164 1.69E-19 2.56E-16
log2FoldChange
and padj
columns. From this output, we see that ATF4 is the most significantly upregulated gene by translation efficiency between the TM and UNTR conditions, which is what we expect (see the XPRESSyourself manuscript for further discussion of this example). Further explanations of the other columns of this output can be found in the DESeq2 documentation.Example 2: Analyze RNA-seq data¶
EXP
vs WT
. To ensure this comparison if performed correctly, we need to force these Condition
factor labels to be alphabetical. We will thus rename them b_EXP
and a_WT
and do the following:**expression_counts.tsv**
s1 s2 s3 s4 ...
ENSG00000227232 66 59 1 82 ...
ENSG00000240361 35 0 7 72 ...
ENSG00000238009 20 70 85 78 ...
ENSG00000241860 96 7 93 38 ...
ENSG00000187634 73 41 92 77 ...
**sample_info.tsv**
Sample Condition
s1 a_WT
s2 a_WT
s3 a_WT
s4 a_WT
s5 b_EXP
s6 b_EXP
s7 b_EXP
s8 b_EXP
$ xpresspipe diffxpress -i test_r/test_dataset.tsv --sample test_r/sample_info.tsv --design Condition
Example 3: Analyze RNA-seq data that was prepared in different batches¶
Batch
factor column and provide different batch labels. This example below will control for batch effect and compare EXP
vs WT
expression.**expression_counts.tsv**
s1 s2 s3 s4 ...
ENSG00000227232 66 59 1 82 ...
ENSG00000240361 35 0 7 72 ...
ENSG00000238009 20 70 85 78 ...
ENSG00000241860 96 7 93 38 ...
ENSG00000187634 73 41 92 77 ...
**sample_info.tsv**
Sample Condition Batch
s1 a_WT batch1
s2 a_WT batch1
s3 a_WT batch1
s4 a_WT batch1
s5 b_EXP batch2
s6 b_EXP batch2
s7 b_EXP batch2
s8 b_EXP batch2
$ xpresspipe diffxpress -i test_r/test_dataset.tsv --sample test_r/sample_info.tsv --design Batch+Condition
rRNA Probe¶
rrnaProbe
sub-module was designed to facilitate this process.rrnaProbe
works by doing the following:$ xpresspipe rrnaProbe --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to zipped FASTQC files |
-o </path/filename>, --output </path/filename> |
Path and file name to write output |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
-m <value>, --min_overlap <value> |
Minimum number of bases that must match on a side to combine sequences (default: 5) |
--footprint_only |
Only take zip files that are ribosome profiling footprints (file names must contain “FP”, “RPF”, or “FOOTPRINT”) |
Example 1: Generate rank-ordered list of over-represented sequences¶
$ xpresspipe rrnaProbe -i riboprof_out/fastqc_out/ -o riboprof_out/sequences.txt --footprint_only
TTGATGATTCATAATAACTTTTCGAATCGCAT 514832
TATAAATCATTTGTATACGACTTAGAT 121739
TTGATGATTCATAATAACTTTTCGAATCGCAT 15776
TTTGATGATTCATAATAACTTTTCGAATCGCAC 33325
ATAAATCATTTGTATACGACTTAGAC 13603
Read Pre-Processing¶
Read Trimming¶
Arguments¶
$ xpresspipe trim --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to input directory – if paired-end, file names should be exactly the same except for r1/r2.fastq or similar suffix |
-o <path>, --output <path> |
Path to output directory |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
-a <adapter1 ...> [<adapter1 ...> ...] , --adapter <adapter1 ...> [<adapter1 ...> ...] |
Specify adapter(s) in list of strings – for single-end, only provide one adapter – if None are provided, software will attempt to auto-detect adapters – if “POLYX” is provided as a single string in the list, polyX adapters will be trimmed. If you want to auto-detect adapters in for paired-end reads, provide None twice |
-q <PHRED_value>, --quality <PHRED_value> |
PHRED read quality threshold (default: 28 ) |
--min_length <length_value> |
Minimum read length threshold to keep for reads (default: 17 ) |
--max_length <length_value> |
Maximum read length threshold to keep for reads (default: 0 ). Setting this argument to 0 will result in no upper length limit. |
--front_trim <length> |
Number of base pairs to trim from the 5’ ends of reads (not available for polyX trimming) (default: 1) |
--umi_location <location> |
Provide parameter to process UMIs – provide location (if working with internal UMIs that need to be processed after adapter trimming, provide “3prime”; else see fastp documentation for more details, generally for single-end sequencing, you would provide ‘read1’ here; does not work with -a polyX option) |
--umi_length <length> |
Provide parameter to process UMIs – provide UMI length (must provide the –umi_location argument); does not work with -a polyX option) |
--spacer_length <length> |
Provide UMI spacer length, if exists. (default: 0) |
-m |
Number of max processors to use for tasks (default: Max) |
Example 1: Trim ribosome profiling sequence data using default preferences¶
fastq
-like and found in the -i riboprof_test/
directory. Can be uncompressed or compressed via gz
or zip
-o riboprof_out/
$ xpresspipe trim -i riboprof_test/ -o riboprof_out/
Example 2: Predict adapter and trim ribosome profiling sequence data¶
--adapters
argument was not passed, so an attempt to discover adapter sequences will be made (this is not always the most efficient or thorough method of trimming and providing the adapter sequences is recommended)$ xpresspipe trim -i riboprof_test/ -o riboprof_out/ --min_length 22 -m 6
Example 3: Trim adapter from ribosome profiling reads¶
--adapters
argument was passed, so adapter sequences will trimmed explicitly$ xpresspipe trim -i riboprof_test/ -o riboprof_out/ -a CTGTAGGCACCATCAAT
Example 4: Predict adapter and trim paired-end sequence data¶
--adapters
argument was passed as None None
, so an attempt to discover adapter sequences will be made for paired-end reads. The -a None None
syntax is essential for trim
to recognize the reads as paired-end$ xpresspipe trim -i pe_test/ -o pe_out/ -a None None
Example 5: Pass explicit adapter and trim paired-end sequence data¶
--adapters
argument was passed, so adapter sequences will trimmed explicitly$ xpresspipe trim -i pe_test/ -o pe_out/ -a ACACTCTTTCCCTACACGACGCTCTTCCGATC GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
Example 6: Trim single-end sequence data of polyX adapters¶
--adapters POLYX
argument was passed, so adapter sequences will trimmed of polyX sequences$ xpresspipe trim -i se_test/ -o se_out/ -a POLYX
Example 7: Trim adapter from ribosome profiling reads and process UMIs¶
--adapters
argument was passed, so adapter sequences will trimmed explicitly--umi_location
argument was passed, so adapter sequences will trimmed of UMI sequences from, in this case, the 3’-end of reads--umi_length
argument was passed, so adapter sequences will process UMIs as 5 nucleotides long in this case$ xpresspipe trim \
-i riboprof_test/ \
-o riboprof_out/ \
-a CTGTAGGCACCATCAAT \
--umi_location 3prime \
--umi_length 5
Alignment¶
Note
rRNA depletion using the --remove_rrna
option removes rRNA alignments from BAM files. This works by generating a BED file behind the scenes for rRNA transcripts, and removing them from the genome-aligned BAM file using bedtools intersect
. For transcriptome-aligned BAM files, a modified GTF file is generated for this step only with rRNA records removed in order to prevent their transcript mapping during this step.
Arguments¶
$ xpresspipe align --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to input directory |
-o <path>, --output <path> |
Path to output directory |
-r <path>, --reference <path> |
Path to parent organism reference directory (must have a file called transcripts.gtf within) |
-t <SE or PE>, --type <SE or PE> |
Sequencing type (“SE” for single-end, “PE” for paired-end) |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
--two-pass |
Use a two-pass STAR alignment for novel splice junction discovery |
--remove_rrna |
Provide flag to remove rRNA records from alignment files (BAM files) |
--no_multimappers> |
Include flag to remove multimapping reads to be output and used in downstream analyses |
--deduplicate |
Include flag to quantify reads with de-duplication (will search for files with suffix _dedupRemoved.bam ) |
--vcf </path/to/file.vcf> |
Provide full path and file name to VCF file if you would like detect personal variants overlapping alignments |
--output_bed |
Include flag to output BED files for each aligned file |
--sjdbOverhang <sjdbOverhang_amount> |
Specify length of genomic sequences for constructing splice-aware reference. Ideal length is read length - 1 , so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of 100 should work in most cases |
--mismatchRatio <mismatchRatio> |
Alignment ratio of mismatches to mapped length is less than this value. See STAR documentation for more information on setting this parameter |
--seedSearchStartLmax <seedSearchStartLmax> |
Adjusting this parameter by providing a lower number will improve mapping sensitivity (recommended value = 15 for reads ~ 25 nts). See STAR documentation for more information on setting this parameter |
genome_size |
Only needs to be changed if this argument was provided curing reference building AND using a two-pass alignment. Enter the size of your organism’s genome in nucleotides |
-m |
Number of max processors to use for tasks (default: No limit) |
Single-End RNAseq Alignment¶
--sjdbOverhang
argument, the same value should be entered that was used when creating the STAR reference files.Example 1: Single-end RNAseq alignment¶
fastq
-like and found in the -i /path/to/input/files/
directory. Can be uncompressed or compressed via gz
or zip
-o riboseq_out/
--type
is specified as ‘SE’ and path to parent reference directory is provided--sjdbOverhang
used in reference creation is provided. Failure to do so will trigger an erroroutput
$ xpresspipe align -i /path/to/input/files/ -o riboseq_out/ -t SE -r /path/to/reference/ --sjdbOverhang 49 --output_bed --output_bigwig
Paired-End RNAseq Alignment¶
--sjdbOverhang
argument, the same value should be entered that was used when creating the STAR reference files.Example 1: Paired-end RNAseq alignment¶
fastq
-like and found in the -i pe_test/
directory. Can be uncompressed or compressed via gz
or zip
-o pe_out/
--type
is specified as ‘PE’ and path to parent reference directory is provided--sjdbOverhang
used in reference creation is provided. Failure to do so will trigger an error. In this case, since the reference was created using default values, the optional flag is not used$ xpresspipe align -i /path/to/input/files/ -o riboseq_out -t PE -r /path/to/reference/
Read Quantification¶
Quantifying and Collating Reads¶
Arguments¶
$ xpresspipe count --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to input directory of SAM files |
-o <path>, --output <path> |
Path to output directory |
-g </path/transcripts.gtf> , --gtf </path/transcripts.gtf> |
Path and file name to GTF used for alignment quantification (if a modified GTF was created, this should be provided here; if using Cufflinks and you want isoform abundance estimates, important that you do not provide a longest transcript only GTF) |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
-e , --experiment |
Experiment name |
-c , --quantification_method |
Specify quantification method (default: htseq; other option: cufflinks. If using Cufflinks, no downstream sample normalization is required) |
--feature_type <feature> |
Specify feature type (3rd column in GTF file) to be used if quantifying with htseq (default: CDS) |
--stranded <fr-unstranded/fr-firststrand /fr-secondstrand||no/yes> |
Specify whether library preparation was stranded (Options before || correspond with Cufflinks inputs, options after correspond with htseq inputs) |
--deduplicate |
Include flag to quantify reads with de-duplication (will search for files with suffix _dedupRemoved.bam ) |
--bam_suffix |
Change from default suffix of _Aligned.sort.bam |
-m |
Number of max processors to use for tasks (default: No limit) |
Example 1: Count ribosome profiling alignments¶
$ xpresspipe count -i riboseq_out/alignments/ -o riboseq_out/ -r se_reference/ -g se_reference/transcripts_codingOnly_truncated.gtf -e se_test
Example 2: Count paired-end alignments¶
$ xpresspipe count -i pe_out/alignments/ -o pe_out/ -r pe_reference/
Normalize¶
Sample Normalization¶
Assumptions¶
batch
argumentNormalization Methods¶
Batch Correction¶
Arguments¶
$ xpresspipe normalizeMatrix --help
Required Arguments | Description |
---|---|
-i <path/filename.tsv>, --input <path/filename.tsv> |
Path and file name of expression counts matrix |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
--method <RPM, RPKM, FPKM, LOG> |
Normalization method to perform (options: “RPM”, “TPM”, “RPKM”, “FPKM”) – if using either TPM, RPKM, or FPKM, a GTF reference file must be included |
-g </path/transcripts.gtf>, --gtf </path/transcripts.gtf> |
Path and file name to reference GTF (RECOMMENDED: Do not use modified GTF file) |
--batch </path/filename.tsv> |
Include path and filename of dataframe with batch normalization parameters |
Example 1: Perform RPKM normalization on single-end RNA-seq data¶
$ xpresspipe normalizeMatrix -i riboprof_out/counts/se_test_counts_table.tsv --method RPKM -g se_reference/transcripts_coding_truncated.gtf
Example 2: Perform batch normalization on RNA-seq data¶
> batch = pd.read_csv('./riboprof_out/counts/batch_info.tsv', sep='\t', index_col=0)
> batch
Sample Batch
0 s1 batch1
1 s2 batch2
2 s3 batch1
3 s4 batch2
$ xpresspipe normalizeMatrix -i riboprof_out/counts/se_test_counts_table.tsv --batch riboprof_out/counts/batch_info.tsv
Other Features¶
Convert Counts Table Gene Names¶
Arguments¶
$ xpresspipe convertNames --help
Required Arguments | Description |
---|---|
-i <path/filename>, --input <path/filename> |
Path and file name to sequence dataframe |
-g </path/transcripts.gtf> , --gtf </path/transcripts.gtf> |
Path and file name to GTF |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
--orig_name_label <label> |
Label of original name (usually “gene_id “) |
--orig_name_location <position> |
Position in last column of GTF where relevant data is found (i.e. 0 would be the first sub-string before the first comma, 3 would be the third sub-string after the second comma before the third comma) |
--new_name_label <label> |
Label of original name (usually “gene_id “) |
--new_name_location <position> |
Position in last column of GTF where relevant data is found (i.e. 0 would be the first sub-string before the first comma, 3 would be the third sub-string after the second comma before the third comma) |
--refill <label> |
In some cases, where common gene names are unavailable, the dataframe will fill the gene name with the improper field of the GTF. In this case, specify this improper string and these values will be replaced with the original name |
Example 1: Convert gene names in count dataframe¶
$ xpresspipe convertNames -i riboprof_out/counts/se_test_counts_table.csv -g se_reference/transcripts.gtf
FAQs¶
A step of the pipeline is erroring for no apparent reason¶
geneCoverage
module on a supercomputing cluster, the pipeline responded saying it couldn’t find the appropriate index file. It turned out the R package, GenomicFeatures was not downloaded due to issues with the rtracklayers package. For this situation, we fixed it by uninstalling Anaconda and reinstalling the dependencies, as below:# Run each of these steps. If a command doesn't work, skip to the next one
$ conda install anaconda-clean
$ anaconda-clean --yes
$ rm -rf ~/miniconda
$ rm ~/.condarc
$ rm -r ~/.conda/
$ cd ~
$ curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash ~/Miniconda3-latest-Linux-x86_64.sh
$ conda env base --file XPRESSpipe/requirements.yml
yes
to all prompts. If you wish to install the XPRESSpipe dependencies to their own environment, replace base
with your_environment_name_here
in the last step. If XPRESSpipe continues to malfunction after completion of these steps, please reach out to us on the XPRESSpipe issues forum.The pipeline breaks because of a segmentation fault during alignment.¶
max_processors
with the number in the log file stated as available. For a computing node with 64 GB of RAM available, we generally see that 20 CPUs is stable. See log example below:sh: line 1: 70311 Segmentation fault STAR --runThreadN 30 ...
or
WARNING: fastp uses up to 16 threads although you specified 32
Updates¶
v0.6.3¶
pip install .
method, a new install script is provided that handles installation of some dependencies better. The new method now only requires the user to run bash install.sh
. See the Installation page for an updated walkthrough on this update.args_dict
Previous versions¶
v0.6.2¶
--suppress_version_check
flag to enable use of XPRESSpipe without internet access--smoothen
flag to any module that uses the geneCoverage
sub-module. By default, a sliding window will not be used to smoothen the geneCoverage plots. If provided, a rolling window set at 20 will be used to smoothen the plots.v0.6.1¶
--ucsc_format
flag to the curateReference
or modifyGTF
sub-modules. These modifications in format only apply to XPRESSpipe GTF truncation features. Any formatting errors with the GTF file that pertain to alignment, counting, etc. dependencies will need to be addressed by the user.v0.6.0¶
v0.5.0¶
xpresspipe build
) to include recent additionsR 3.5.1
or greater would create an error where samtools markdup
would freezeR 3.5.1
or greater would create library linking error to stringi
, causing GenomicFeatures
to not function. Added to RbuildIndex.r
to reinstall stringi
, which appears to clear up the issue.v0.4.4¶
v0.4.3¶
v0.4.2¶
v0.4.1¶
v0.4.0¶
periodicity
is now called by p_sites