Paired-End RNA-seq Pipeline¶

The following pipeline will pre-process, align, and quality check paired-end RNA-seq samples using the sub-modules discussed in earlier chapters. For more detailed information concerning these steps, please refer to the Align chapter.

Arguments¶

The help menu can be accessed by calling the following from the command line:

$ xpresspipe peRNAseq --help

Required Arguments	Description
`-i <path>, --input <path>`	Path to input directory – if paired-end, file names should be exactly the same except for `r1/r2.fastq` or similar suffix
`-o <path>, --output <path>`	Path to output directory
`-r <path>, --reference <path>`	Path to parent organism reference directory
`-g </path/transcripts.gtf>`, `--gtf </path/transcripts.gtf>`	Path and file name to GTF used for alignment quantification (only used for HTSeq quantification)
`-e`, `--experiment`	Experiment name

Optional Arguments	Description
`--suppress_version_check`	Suppress version checks and other features that require internet access during processing
`--two-pass`	Use a two-pass STAR alignment for novel splice junction discovery
`-a <adapter1 ...> [<adapter1 ...> ...]`, `--adapter <adapter1 ...> [<adapter1 ...> ...]`	Specify adapter(s) in list of strings – for single-end, only provide one adapter – if `None` are provided, software will attempt to auto-detect adapters – if “POLYX” is provided as a single string in the list, polyX adapters will be trimmed. If you want to auto-detect adapters in for paired-end reads, provide `None` twice
`-q <PHRED_value>, --quality <PHRED_value>`	PHRED read quality threshold (default: `28`)
`--min_length <length_value>`	Minimum read length threshold to keep for reads (default: `17`)
`--max_length <length_value>`	Maximum read length threshold to keep for reads (default: `0`). Setting this argument to 0 will result in no upper length limit.
`--remove_rrna`	Provide flag to remove rRNA records from alignment files (BAM files)
`--front_trim <length>`	Number of base pairs to trim from the 5’ ends of reads (not available for polyX trimming) (default: 1)
`--umi_location <location>`	Provide parameter to process UMIs – provide location (see fastp documentation for more details, generally for single-end sequencing, you would provide ‘read1’ here; does not work with -a polyX option)
`--umi_length <length>`	Provide parameter to process UMIs – provide UMI length (must provide the –umi_location argument); does not work with -a polyX option)
`--spacer_length <length>`	Provide UMI spacer length, if exists. (default: 0)
`--no_multimappers>`	Include flag to remove multimapping reads to be output and used in downstream analyses
`--deduplicate`	Include flag to quantify reads with de-duplication (will search for files with suffix `_dedupRemoved.bam`)
`--output_bed`	Include flag to output BED files for each aligned file
`-c`, `--quantification_method`	Specify quantification method (default: htseq; other option: cufflinks. If using Cufflinks, no downstream sample normalization is required)
`--feature_type <feature>`	Specify feature type (3rd column in GTF file) to be used if quantifying with htseq (default: CDS)
`--stranded <fr-unstranded/fr-firststrand` `/fr-secondstrand\|\|no/yes>`	Specify whether library preparation was stranded (Options before \|\| correspond with Cufflinks inputs, options after correspond with htseq inputs)
`--method <RPM, RPKM, FPKM, TPM>`	Normalization method to perform (options: “RPM”, “TPM”, “RPKM”, “FPKM”) – if using either TPM, RPKM, or FPKM, a GTF reference file must be included
`--vcf </path/to/file.vcf>`	Provide full path and file name to VCF file if you would like detect personal variants overlapping alignments
`--batch </path/filename.tsv>`	Include path and filename of dataframe with batch normalization parameters
`--sjdbOverhang <sjdbOverhang_amount>`	Specify length of genomic sequences for constructing splice-aware reference. Ideal length is `read length - 1`, so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of `100` should work in most cases
`--mismatchRatio <mismatchRatio>`	Alignment ratio of mismatches to mapped length is less than this value. See STAR documentation for more information on setting this parameter
`--seedSearchStartLmax <seedSearchStartLmax>`	Adjusting this parameter by providing a lower number will improve mapping sensitivity (recommended value = 15 for reads ~ 25 nts). See STAR documentation for more information on setting this parameter
`genome_size`	Only needs to be changed if this argument was provided curing reference building AND using a two-pass alignment. This should be the length of the organism’s genome in nucleotides
`-m`	Number of max processors to use for tasks (default: No limit)

Example 1: Run pipeline on paired-end RNA-seq sample files¶

$ xpresspipe peRNAseq \
              -i pe_test \
              -o pe_out \
              -r pe_reference \
              --gtf transcripts.gtf \
              -e pe_test \
              -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
              --method TPM \
              --sjdbOverhang 100