Read Pre-Processing

Read Trimming

Trimming is a necessary part of RNAseq data processing due to the technological limitations described below:

- Inherent in RNA-seq library creation, RNA is fragmented and adapter sequences are ligated to the sequence. These adapters include information such as sample batch and act as a primer for the sequencer to recognize the fragment as something to analyze. However, these adapters, once sequenced, prevent alignment to a reference as large chunks of the fragment are synthetic sequence not found in the actual organism’s genome/transcriptome.
- A sequencer’s job is to read a fragment base by base and determine the nucleotide species each step of the way. While the technology has greatly improved over the years, a probability of error remains. Mis-called bases can prevent proper alignment of the sequenced fragment to the reference. Therefore, it is important for low confidence base calls to be trimmed from each read.

Trimming is performed by fastp.

Arguments

The help menu can be accessed by calling the following from the command line:
$ xpresspipe trim --help
Required Arguments Description
-i <path>, --input <path> Path to input directory – if paired-end, file names should be exactly the same except for r1/r2.fastq or similar suffix
-o <path>, --output <path> Path to output directory
Optional Arguments Description
--suppress_version_check Suppress version checks and other features that require internet access during processing
-a <adapter1 ...> [<adapter1 ...> ...], --adapter <adapter1 ...> [<adapter1 ...> ...] Specify adapter(s) in list of strings – for single-end, only provide one adapter – if None are provided, software will attempt to auto-detect adapters – if “POLYX” is provided as a single string in the list, polyX adapters will be trimmed. If you want to auto-detect adapters in for paired-end reads, provide None twice
-q <PHRED_value>, --quality <PHRED_value> PHRED read quality threshold (default: 28)
--min_length <length_value> Minimum read length threshold to keep for reads (default: 17)
--max_length <length_value> Maximum read length threshold to keep for reads (default: 0). Setting this argument to 0 will result in no upper length limit.
--front_trim <length> Number of base pairs to trim from the 5’ ends of reads (not available for polyX trimming) (default: 1)
--umi_location <location> Provide parameter to process UMIs – provide location (if working with internal UMIs that need to be processed after adapter trimming, provide “3prime”; else see fastp documentation for more details, generally for single-end sequencing, you would provide ‘read1’ here; does not work with -a polyX option)
--umi_length <length> Provide parameter to process UMIs – provide UMI length (must provide the –umi_location argument); does not work with -a polyX option)
--spacer_length <length> Provide UMI spacer length, if exists. (default: 0)
-m Number of max processors to use for tasks (default: Max)

Example 1: Trim ribosome profiling sequence data using default preferences

- Raw reads are fastq-like and found in the -i riboprof_test/ directory. Can be uncompressed or compressed via gz or zip
- A general output directory has been created, -o riboprof_out/
- All other arguments use the default value
$ xpresspipe trim -i riboprof_test/ -o riboprof_out/

Example 2: Predict adapter and trim ribosome profiling sequence data

- A minimum read length of 22 nucleotides after trimming is required in order to keep the read
- A maximum or 6 processors can be used for the task
- The --adapters argument was not passed, so an attempt to discover adapter sequences will be made (this is not always the most efficient or thorough method of trimming and providing the adapter sequences is recommended)
$ xpresspipe trim -i riboprof_test/ -o riboprof_out/ --min_length 22 -m 6

Example 3: Trim adapter from ribosome profiling reads

- The default minimum read length threshold will be used
- The maximum number of processors will be used by default
- The --adapters argument was passed, so adapter sequences will trimmed explicitly
$ xpresspipe trim -i riboprof_test/ -o riboprof_out/ -a CTGTAGGCACCATCAAT

Example 4: Predict adapter and trim paired-end sequence data

- The --adapters argument was passed as None None, so an attempt to discover adapter sequences will be made for paired-end reads. The -a None None syntax is essential for trim to recognize the reads as paired-end
$ xpresspipe trim -i pe_test/ -o pe_out/ -a None None

Example 5: Pass explicit adapter and trim paired-end sequence data

- The --adapters argument was passed, so adapter sequences will trimmed explicitly
$ xpresspipe trim -i pe_test/ -o pe_out/ -a ACACTCTTTCCCTACACGACGCTCTTCCGATC GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG

Example 6: Trim single-end sequence data of polyX adapters

- The --adapters POLYX argument was passed, so adapter sequences will trimmed of polyX sequences
$ xpresspipe trim -i se_test/ -o se_out/ -a POLYX

Example 7: Trim adapter from ribosome profiling reads and process UMIs

- The default minimum read length threshold will be used
- The maximum number of processors will be used by default
- The --adapters argument was passed, so adapter sequences will trimmed explicitly
- The --umi_location argument was passed, so adapter sequences will trimmed of UMI sequences from, in this case, the 3’-end of reads
- The --umi_length argument was passed, so adapter sequences will process UMIs as 5 nucleotides long in this case
$ xpresspipe trim \
  -i riboprof_test/ \
  -o riboprof_out/ \
  -a CTGTAGGCACCATCAAT \
  --umi_location 3prime \
  --umi_length 5