Normalize¶

Note: Sample and batch normalization can be performed in a single command. If this is done, batch normalization will be performed following sample normalization.

Sample Normalization¶

Due to inherent biases in RNA-seq samples (most commonly, different amounts of total RNA per sample in a given lane), samples must be normalized to obtain an accurate representation of transcription per sample. Additional normalization can be performed to normalize for transcript length (“per kilobase million”) as longer transcripts will naturally create more fragments mapping to a given gene, thus potentially making 1 transcript appear as many when quantified.

Assumptions¶

- R is installed on your machine and is in your $PATH if using the batch argument

- All input files are tab-delimited (with .txt or .tsv suffix)

Normalization Methods¶

The following equations summarize different way to normalize samples for RNA-seq:

Reads per Million
\(RPM_{g} = \frac{1e6 \cdot r_{\textit{ge}}}{\sum_{g=1}^{n} r_{\textit{ge}}}\)

Reads per Kilobase of Reads per Million
\(RPKM_{g} = \frac{1e9 \cdot r_{\textit{ge}}}{(\sum_{g=1}^{n} r_{\textit{ge}}) \cdot \textit{l} _{\textit{ge}}}\)

Fragments per Kilobase of Fragments per Million
\(FPKM_{g} = \frac{1e9 \cdot f_{\textit{ge}}}{(\sum_{g=1}^{n} f_{\textit{ge}}) \cdot \textit{l} _{\textit{ge}}}\)

Transcripts per Million  (same as RPKM, but order of operations is different)

\(TPM_{g} = \frac{1e6 \cdot r_{\textit{ge}}}{(\sum_{g=1}^{n} (\frac{1e3 \cdot r_{\textit{ge}}}{l_{\textit{ge}}})) \cdot \textit{l} _{\textit{ge}}}\)

In each of the above, assume g is gene n, ge is cumulative exon space for gene n, r is total reads, f is total fragments, and l is length

Batch Correction¶

When multiple people perform library preparation, or when libraries are prepared on different days, this can lead to inherent biases in count distributions between batches of samples. It is therefore necessary to normalize these effects when appropriate.

Arguments¶

The help menu can be accessed by calling the following from the command line:

$ xpresspipe normalizeMatrix --help

Required Arguments	Description
`-i <path/filename.tsv>, --input <path/filename.tsv>`	Path and file name of expression counts matrix

Optional Arguments	Description
`--suppress_version_check`	Suppress version checks and other features that require internet access during processing
`--method <RPM, RPKM, FPKM, LOG>`	Normalization method to perform (options: “RPM”, “TPM”, “RPKM”, “FPKM”) – if using either TPM, RPKM, or FPKM, a GTF reference file must be included
`-g </path/transcripts.gtf>, --gtf </path/transcripts.gtf>`	Path and file name to reference GTF (RECOMMENDED: Do not use modified GTF file)
`--batch </path/filename.tsv>`	Include path and filename of dataframe with batch normalization parameters

Example 1: Perform RPKM normalization on single-end RNA-seq data¶

$ xpresspipe normalizeMatrix -i riboprof_out/counts/se_test_counts_table.tsv --method RPKM -g se_reference/transcripts_coding_truncated.gtf

Example 2: Perform batch normalization on RNA-seq data¶

Inputs

> batch = pd.read_csv('./riboprof_out/counts/batch_info.tsv', sep='\t', index_col=0)
> batch
  Sample  Batch
0 s1      batch1
1 s2      batch2
2 s3      batch1
3 s4      batch2

Code

$ xpresspipe normalizeMatrix -i riboprof_out/counts/se_test_counts_table.tsv --batch riboprof_out/counts/batch_info.tsv