Ribosome Profiling Pipeline¶

The following pipeline will pre-process, align, and quality check ribosome profiling samples using the sub-modules discussed in other sections of this documentation. For more detailed information concerning these steps, please refer to the Align chapter for the step you are interested in.

Arguments¶

The help menu can be accessed by calling the following from the command line:

$ xpresspipe riboseq --help

Required Arguments	Description
`-i <path>, --input <path>`	Path to input directory – if paired-end, file names should be exactly the same except for `r1/r2.fastq` or similar suffix
`-o <path>, --output <path>`	Path to output directory
`-r <path>, --reference <path>`	Path to parent organism reference directory
`-g </path/transcripts.gtf>`, `--gtf </path/transcripts.gtf>`	Path and file name to GTF used for alignment quantification (only used for HTSeq quantification)
`-cdna_fasta </path/cdna_fasta.fa>`	Path and file name to reference cDNA FASTA file for P-site reference generation/location
`-e`, `--experiment`	Experiment name

Optional Arguments	Description
`--suppress_version_check`	Suppress version checks and other features that require internet access during processing
`--two-pass`	Use a two-pass STAR alignment for novel splice junction discovery
`-a <adapter1 ...> [<adapter1 ...> ...]`, `--adapter <adapter1 ...> [<adapter1 ...> ...]`	Specify adapter(s) in list of strings – for single-end, only provide one adapter – if `None` are provided, software will attempt to auto-detect adapters – if “POLYX” is provided as a single string in the list, polyX adapters will be trimmed. If you want to auto-detect adapters in for paired-end reads, provide `None` twice
`-q <PHRED_value>, --quality <PHRED_value>`	PHRED read quality threshold (default: `28`)
`--min_length <length_value>`	Minimum read length threshold to keep for reads (default: `17`)
`--max_length <length_value>`	Maximum read length threshold to keep for reads (default: `0`). Setting this argument to 0 will result in no upper length limit.
`--remove_rrna`	Provide flag to remove rRNA records from alignment files (BAM files)
`--front_trim <length>`	Number of base pairs to trim from the 5’ ends of reads (not available for polyX trimming) (default: 1)
`--umi_location <location>`	Provide parameter to process UMIs – provide location (if working with internal UMIs that need to be processed after adapter trimming, provide “3prime”; else see fastp documentation for more details, generally for single-end sequencing, you would provide ‘read1’ here; does not work with -a polyX option)
`--umi_length <length>`	Provide parameter to process UMIs – provide UMI length (must provide the –umi_location argument); does not work with -a polyX option)
`--spacer_length <length>`	Provide UMI spacer length, if exists. (default: 0)
`--no_multimappers>`	Include flag to remove multimapping reads to be output and used in downstream analyses
`--deduplicate`	Include flag to quantify reads with de-duplication (will search for files with suffix `_dedupRemoved.bam`)
`--output_bed`	Include flag to output BED files for each aligned file
`-c`, `--quantification_method`	Specify quantification method (default: htseq; other option: cufflinks. If using Cufflinks, no downstream sample normalization is required)
`--feature_type <feature>`	Specify feature type (3rd column in GTF file) to be used if quantifying with htseq (default: CDS)
`--stranded <fr-unstranded/fr-firststrand` `/fr-secondstrand\|\|no/yes>`	Specify whether library preparation was stranded (Options before \|\| correspond with Cufflinks inputs, options after correspond with htseq inputs)
`--method <RPM, RPKM, FPKM, TPM>`	Normalization method to perform (options: “RPM”, “TPM”, “RPKM”, “FPKM”) – if using either TPM, RPKM, or FPKM, a GTF reference file must be included
`--vcf </path/to/file.vcf>`	Provide full path and file name to VCF file if you would like detect personal variants overlapping alignments
`--batch </path/filename.tsv>`	Include path and filename of dataframe with batch normalization parameters
`--sjdbOverhang <sjdbOverhang_amount>`	Specify length of genomic sequences for constructing splice-aware reference. Ideal length is `read length - 1`, so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of `100` should work in most cases
`--mismatchRatio <mismatchRatio>`	Alignment ratio of mismatches to mapped length is less than this value. See STAR documentation for more information on setting this parameter
`--seedSearchStartLmax <seedSearchStartLmax>`	Adjusting this parameter by providing a lower number will improve mapping sensitivity (recommended value = 15 for reads ~ 25 nts). See STAR documentation for more information on setting this parameter
`genome_size`	Only needs to be changed if this argument was provided curing reference building AND using a two-pass alignment. This should be the length of the organism’s genome in nucleotides
`-m`	Number of max processors to use for tasks (default: No limit)

Example 1: Run pipeline on ribosome profiling sample files¶

$ xpresspipe riboseq \
              -i riboprof_test \
              -o ribopipe_out \
              -r se_reference \
              --gtf se_reference/transcript_CT.gtf \
              --cdna_fasta se_reference/cdna_seqs.fa \
              -e riboprof_test \
              -a CTGTAGGCACCATCAAT \
              --method RPM \
              --sjdbOverhang 49

Example 2: Run pipeline on ribosome profiling sample files with UMIs¶

If following a ribosome profiling protocol that utilizes 3’ UMIs (unique molecular identifiers), as in McGlincey, 2017, you will want to provide the relevant details to the riboseq sub-module. In this case, they use a 5 nucleotide UMI that is found at the 3’-end of each read, so the --umi_location 3prime and --umi_length 5 options should be used. If a UMI spacer is part of the UMI structure, this can be provided with the --umi_spacer option with the spacer length as input.

$ xpresspipe riboseq \
              -i riboprof_test \
              -o ribopipe_out \
              -r se_reference \
              --gtf se_reference/transcript_CT.gtf \
              --cdna_fasta se_reference/cdna_seqs.fa \
              -e riboprof_test \
              -a CTGTAGGCACCATCAAT \
              --method RPM \
              --sjdbOverhang 49 \
              --umi_location 3prime \
              --umi_length 5 \
              --umi_spacer 0