Analysis¶

Differential Expression Analysis¶

Differential Expression analysis allows one to determine significantly enriched or depleted genes between two conditions. XPESSpipe acts as a wrapper for DESeq2. Please refer to its documentation for more information.

Note

If intending to use the diffxpress sub-module, you need to have used --quantification_method htseq during read quantification as DESeq2 requires integer count data.

Requirements:

- R is installed on your machine and is in your $PATH (this should be handled in the installation)
- All input files are tab-delimited (with .txt or .tsv suffix)
- Design formula does not include the tilde (~) and there are no spaces

Sample Factor Files¶

Different factors to be evaluated in the differential expression analysis should each be denoted as a separate factor column in the sample_info file. For example, if you were

evaluating a experimental vs control experiment for RNA-sequencing, you would provide a sample_info file as follows:

**sample_info.txt**

  Sample      Condition
  s1_rna      a_WT
  s2_rna      a_WT
  s3_rna      b_EXP
  s4_rna      b_EXP

Your base (denominator) parameter in a given factor column in the sample_info file must be first alphabetically. In the case provided above, we want to compare the experimental condition VS the wild-type control condition, however these labels are not alphabetical. In this case, you can append letters to the beginning to force alphabetical order. For example, if you performed a experiment vs wild-type experiment, you would need to use the labels b_experiment vs a_wild-type to force a b_experiment / a_wild-type comparison.

If we want to consider additional factors, such as translation efficiency of footprint vs RNA-sequence samples for ribosome profiling, these should be included as additional factor columns in the sample_info file. Since we want to perform another comparison with the footprint vs RNA-sequencing samples, we need to again ensure that these labels for this “Type” factor are listed in the correct alphabetical order to ensure we are performing a footprint VS RNA-sequencing comparison to reflect translation efficiency.

**sample_info.txt**

  Sample    Condition   Type
  s1_fp     a_WT        RPF
  s1_rna    a_WT        RNA
  s2_fp     a_WT        RPF
  s2_rna    a_WT        RNA
  s3_fp     b_EXP       RPF
  s3_rna    b_EXP       RNA
  s4_fp     b_EXP       RPF
  s4_rna    b_EXP       RNA

The alphabetical order of the factor names (i.e., “Condition”, “Type”) does not matter. Instead, according to the DESeq2 documentation, these design factors are evaluated in the order listed.

The DESeq2 design formula specifies what is being modeled from the data.

Note

As stated in the DESeq2 documentation: With no additional arguments to results, the log2 fold change and Wald test p value will be for the last variable in the design formula, and if this is a factor, the comparison will be the last level of this variable over the reference level (see previous note on factor levels). However, the order of the variables of the design do not matter so long as the user specifies the comparison to build a results table for, using the name or contrast arguments of results.

So, in the case of Type+Condition+Type:Condition where we are interesting in testing the difference in translation efficiencies between conditions, the interaction Type:Condition is the coefficient being tested for differential expression. The model as designed will also account for differences only seen in the Type or Condition co-variates alone (for example, a specific bias to ribosome footprints vs. mRNA fragments).

For more information on factor levels and design parameters, please see the DESeq2 documentation and this note. Any standard design formula that will work in DESeq2 will work within the XPRESSpipe wrapper, as long as the formatted described above is followed.

Other possible variations to DESeq2 analysis are available here, but not all will be compatible with the XPRESSpipe wrapper. In general, the XPRESSpipe wrapper is best suited to simple multi-factor design (Experimental vs Wild-type, Footprints vs RNA-sequencing, plus any other factors relevant to your experiment). For advice in preparing your design formula differently than in the examples listed below, please reach out to us here.

Arguments¶

The help menu can be accessed by calling the following from the command line:

$ xpresspipe diffxpress --help

Required Arguments	Description
`-i <path/filename.tsv>`, `--input <path/filename.tsv>`	Path and file name of expression counts matrix
`-s <path/filename.tsv>`, `--sample <path/filename.tsv>`	Path and file name of sample information matrix
`--design <formula>`	Design formula for differential expression analysis (spaces in command line are conserved in input string. DO NOT INCLUDE ~ OR SPACES IN FORMULA IN COMMAND LINE, will be automatically added)

Optional Arguments	Description
`--suppress_version_check`	Suppress version checks and other features that require internet access during processing
`--shrink`	Provide argument to perform shrinkage of effect size on log fold changes. Useful for visualization and ranking of hits

Example 1: Analyze ribosome profiling data¶

The source files can be found here.

If we want to perform differential expression of translation efficiency for ribosome profiling data, we need to provide Condition and Type factor columns in the sample_info file. If we want to include the RPF / RNA comparison to account for translation efficiency, we would need to include these factor label as a column to ensure the appropriate RPF / RNA evaluation. To perform a comparison between Tm-treated and Untreated cells, we will provide the TM and UNTR labels for the Condition factor. With the provided design formula used below, we will be calculating:

\(\frac{ ( RPF _{\textit{TM}} / RNA _{\textit{TM}} ) } { ( RPF _{\textit{UNTR}} / RNA _{\textit{UNTR}} ) }\)

**tm_counts.tsv**

        ribo_untr_a  ribo_untr_b  ribo_tm_a  ribo_tm_b  untr_a_hek  untr_b_hek  tm_a_hek  tm_b_hek
A1BG    29           43           21         11         67          73          56        85
A2M     3            5            2          2          73          57          32        37
AAAS    1441         1981         934        601        1144        1067        1012      1124
AACS    575          727          310        192        351         335         220       291
AADAT   98           120          51         29         322         315         192       292


**tm_deseq.txt**

Sample         Condition       Type
untr_a_hek     UNTR            RNA
untr_b_hek     UNTR            RNA
ribo_untr_a    UNTR            RPF
ribo_untr_b    UNTR            RPF
tm_a_hek       TM              RNA
tm_b_hek       TM              RNA
ribo_tm_a      TM              RPF
ribo_tm_b      TM              RPF

$ xpresspipe diffxpress -i counts_data.tsv --sample sample_info.txt --design Type+Condition+Type:Condition

The output of this analysis will perform differential expression that reflects both TM vs UNTR and RPF (footprints) vs RNA.

**tm_counts_diffx.tsv**

        baseMean       log2FoldChange        lfcSE             stat            pvalue          padj
ATF4    3283.072674    2.542784311           0.134284453       18.93580577     5.78E-80        5.03E-76
PTP4A1  460.6444433    2.473962772           0.185061193       13.36834986     9.26E-41        4.03E-37
SPEN    7902.554413    1.192124338           0.109445545       10.89239713     1.25E-27        3.63E-24
RPS15A  1823.967865    -1.391099082          0.152069954       -9.147757652    5.81E-20        1.26E-16
DYNC1H1 11985.60418    0.85282198            0.094425503        9.031691164    1.69E-19        2.56E-16

From this output, we can focus on the log2FoldChange and padj columns. From this output, we see that ATF4 is the most significantly upregulated gene by translation efficiency between the TM and UNTR conditions, which is what we expect (see the XPRESSyourself manuscript for further discussion of this example). Further explanations of the other columns of this output can be found in the DESeq2 documentation.

Example 2: Analyze RNA-seq data¶

For a standard two-condition RNA-seq experiment comparison, we are only interested in the differential expression of EXP vs WT. To ensure this comparison if performed correctly, we need to force these Condition factor labels to be alphabetical. We will thus rename them b_EXP and a_WT and do the following:

**expression_counts.tsv**

                s1  s2  s3  s4  ...
ENSG00000227232 66  59  1   82  ...
ENSG00000240361 35  0   7   72  ...
ENSG00000238009 20  70  85  78  ...
ENSG00000241860 96  7   93  38  ...
ENSG00000187634 73  41  92  77  ...


**sample_info.tsv**

Sample  Condition
s1      a_WT
s2      a_WT
s3      a_WT
s4      a_WT
s5      b_EXP
s6      b_EXP
s7      b_EXP
s8      b_EXP

$ xpresspipe diffxpress -i test_r/test_dataset.tsv --sample test_r/sample_info.tsv --design Condition

Example 3: Analyze RNA-seq data that was prepared in different batches¶

If samples were performed in multiple batches and you would like to control for batch effect, you can add a Batch factor column and provide different batch labels. This example below will control for batch effect and compare EXP vs WT expression.

See the DESeq2 documentation example for further information.

**expression_counts.tsv**

                s1  s2  s3  s4  ...
ENSG00000227232 66  59  1   82  ...
ENSG00000240361 35  0   7   72  ...
ENSG00000238009 20  70  85  78  ...
ENSG00000241860 96  7   93  38  ...
ENSG00000187634 73  41  92  77  ...


**sample_info.tsv**

Sample  Condition Batch
s1      a_WT      batch1
s2      a_WT      batch1
s3      a_WT      batch1
s4      a_WT      batch1
s5      b_EXP     batch2
s6      b_EXP     batch2
s7      b_EXP     batch2
s8      b_EXP     batch2

$ xpresspipe diffxpress -i test_r/test_dataset.tsv --sample test_r/sample_info.tsv --design Batch+Condition

rRNA Probe¶

Ribosome RNA (rRNA) contamination is common in RNA-seq library preparation. As the bulk of RNA in a cell at any given time is dedicated to rRNA, and as these rRNA sequences are relatively few and therefore highly repeated, depletion of these sequences is often desired in order to have better depth of coverage of non-rRNA sequences. In order to facilitate this depletion, many commercial kits are available that target specific rRNA sequences for depletion, or that enrich mRNA polyA tails. However, and especially in the case of ribosome profiling experiments, where RNA is digested to create ribosome footprints that commercial depletion kits won’t detect and polyA selection kits are inoperable as footprints will not have the requisite polyA sequence. To this end, custom rRNA probes are recommended, and the rrnaProbe sub-module was designed to facilitate this process.
rrnaProbe works by doing the following:
1. Run FASTQC to detect over-represented sequences
2. Collate these sequences to determine consensus fragments
3. Output rank ordered list of over-represented fragments within the appropriate length range to target for depletion
NOTE: BLAST capability to verify over-represented consensus fragments are indeed rRNA sequences is not yet incorporated, so any sequences that will be used as probes should be BLAST-verified first.

$ xpresspipe rrnaProbe --help

Required Arguments	Description
`-i <path>, --input <path>`	Path to zipped FASTQC files
`-o </path/filename>, --output </path/filename>`	Path and file name to write output

Optional Arguments	Description
`--suppress_version_check`	Suppress version checks and other features that require internet access during processing
`-m <value>, --min_overlap <value>`	Minimum number of bases that must match on a side to combine sequences (default: 5)
`--footprint_only`	Only take zip files that are ribosome profiling footprints (file names must contain “FP”, “RPF”, or “FOOTPRINT”)

Example 1: Generate rank-ordered list of over-represented sequences¶

$ xpresspipe rrnaProbe -i riboprof_out/fastqc_out/ -o riboprof_out/sequences.txt --footprint_only

TTGATGATTCATAATAACTTTTCGAATCGCAT    514832
TATAAATCATTTGTATACGACTTAGAT         121739
TTGATGATTCATAATAACTTTTCGAATCGCAT    15776
TTTGATGATTCATAATAACTTTTCGAATCGCAC   33325
ATAAATCATTTGTATACGACTTAGAC          13603