Paired-End RNA-seq Pipeline

The following pipeline will pre-process, align, and quality check paired-end RNA-seq samples using the sub-modules discussed in earlier chapters. For more detailed information concerning these steps, please refer to the Align chapter.

Arguments

The help menu can be accessed by calling the following from the command line:
$ xpresspipe peRNAseq --help
Required Arguments Description
-i <path>, --input <path> Path to input directory – if paired-end, file names should be exactly the same except for r1/r2.fastq or similar suffix
-o <path>, --output <path> Path to output directory
-r <path>, --reference <path> Path to parent organism reference directory
-g </path/transcripts.gtf>, --gtf </path/transcripts.gtf> Path and file name to GTF used for alignment quantification (only used for HTSeq quantification)
-e, --experiment Experiment name
Optional Arguments Description
--suppress_version_check Suppress version checks and other features that require internet access during processing
--two-pass Use a two-pass STAR alignment for novel splice junction discovery
-a <adapter1 ...> [<adapter1 ...> ...], --adapter <adapter1 ...> [<adapter1 ...> ...] Specify adapter(s) in list of strings – for single-end, only provide one adapter – if None are provided, software will attempt to auto-detect adapters – if “POLYX” is provided as a single string in the list, polyX adapters will be trimmed. If you want to auto-detect adapters in for paired-end reads, provide None twice
-q <PHRED_value>, --quality <PHRED_value> PHRED read quality threshold (default: 28)
--min_length <length_value> Minimum read length threshold to keep for reads (default: 17)
--max_length <length_value> Maximum read length threshold to keep for reads (default: 0). Setting this argument to 0 will result in no upper length limit.
--remove_rrna Provide flag to remove rRNA records from alignment files (BAM files)
--front_trim <length> Number of base pairs to trim from the 5’ ends of reads (not available for polyX trimming) (default: 1)
--umi_location <location> Provide parameter to process UMIs – provide location (see fastp documentation for more details, generally for single-end sequencing, you would provide ‘read1’ here; does not work with -a polyX option)
--umi_length <length> Provide parameter to process UMIs – provide UMI length (must provide the –umi_location argument); does not work with -a polyX option)
--spacer_length <length> Provide UMI spacer length, if exists. (default: 0)
--no_multimappers> Include flag to remove multimapping reads to be output and used in downstream analyses
--deduplicate Include flag to quantify reads with de-duplication (will search for files with suffix _dedupRemoved.bam)
--output_bed Include flag to output BED files for each aligned file
-c, --quantification_method Specify quantification method (default: htseq; other option: cufflinks. If using Cufflinks, no downstream sample normalization is required)
--feature_type <feature> Specify feature type (3rd column in GTF file) to be used if quantifying with htseq (default: CDS)
--stranded <fr-unstranded/fr-firststrand /fr-secondstrand||no/yes> Specify whether library preparation was stranded (Options before || correspond with Cufflinks inputs, options after correspond with htseq inputs)
--method <RPM, RPKM, FPKM, TPM> Normalization method to perform (options: “RPM”, “TPM”, “RPKM”, “FPKM”) – if using either TPM, RPKM, or FPKM, a GTF reference file must be included
--vcf </path/to/file.vcf> Provide full path and file name to VCF file if you would like detect personal variants overlapping alignments
--batch </path/filename.tsv> Include path and filename of dataframe with batch normalization parameters
--sjdbOverhang <sjdbOverhang_amount> Specify length of genomic sequences for constructing splice-aware reference. Ideal length is read length - 1, so for 2x100bp paired-end reads, you would use 100 - 1 = 99. However, the default value of 100 should work in most cases
--mismatchRatio <mismatchRatio> Alignment ratio of mismatches to mapped length is less than this value. See STAR documentation for more information on setting this parameter
--seedSearchStartLmax <seedSearchStartLmax> Adjusting this parameter by providing a lower number will improve mapping sensitivity (recommended value = 15 for reads ~ 25 nts). See STAR documentation for more information on setting this parameter
genome_size Only needs to be changed if this argument was provided curing reference building AND using a two-pass alignment. This should be the length of the organism’s genome in nucleotides
-m Number of max processors to use for tasks (default: No limit)

Example 1: Run pipeline on paired-end RNA-seq sample files

$ xpresspipe peRNAseq \
              -i pe_test \
              -o pe_out \
              -r pe_reference \
              --gtf transcripts.gtf \
              -e pe_test \
              -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
              --method TPM \
              --sjdbOverhang 100