Analysis¶
Differential Expression Analysis¶
Note
If intending to use the diffxpress
sub-module, you need to have used --quantification_method htseq
during read quantification as DESeq2 requires integer count data.
Sample Factor Files¶
sample_info
file. For example, if you wereevaluating a experimental vs control experiment for RNA-sequencing, you would provide a sample_info
file as follows:
**sample_info.txt**
Sample Condition
s1_rna a_WT
s2_rna a_WT
s3_rna b_EXP
s4_rna b_EXP
sample_info
file must be first alphabetically. In the case provided above, we want to compare the experimental condition VS the wild-type control condition, however these labels are not alphabetical. In this case, you can append letters to the beginning to force alphabetical order. For example, if you performed a experiment
vs wild-type
experiment, you would need to use the labels b_experiment
vs a_wild-type
to force a b_experiment
/ a_wild-type
comparison.sample_info
file. Since we want to perform another comparison with the footprint vs RNA-sequencing samples, we need to again ensure that these labels for this “Type” factor are listed in the correct alphabetical order to ensure we are performing a footprint VS RNA-sequencing comparison to reflect translation efficiency.**sample_info.txt**
Sample Condition Type
s1_fp a_WT RPF
s1_rna a_WT RNA
s2_fp a_WT RPF
s2_rna a_WT RNA
s3_fp b_EXP RPF
s3_rna b_EXP RNA
s4_fp b_EXP RPF
s4_rna b_EXP RNA
Note
As stated in the DESeq2 documentation: With no additional arguments to results, the log2 fold change and Wald test p value will be for the last variable in the design formula, and if this is a factor, the comparison will be the last level of this variable over the reference level (see previous note on factor levels). However, the order of the variables of the design do not matter so long as the user specifies the comparison to build a results table for, using the name or contrast arguments of results.
Type+Condition+Type:Condition
where we are interesting in testing the difference in translation efficiencies between conditions, the interaction Type:Condition
is the coefficient being tested for differential expression. The model as designed will also account for differences only seen in the Type
or Condition
co-variates alone (for example, a specific bias to ribosome footprints vs. mRNA fragments).Arguments¶
$ xpresspipe diffxpress --help
Required Arguments | Description |
---|---|
-i <path/filename.tsv> , --input <path/filename.tsv> |
Path and file name of expression counts matrix |
-s <path/filename.tsv> , --sample <path/filename.tsv> |
Path and file name of sample information matrix |
--design <formula> |
Design formula for differential expression analysis (spaces in command line are conserved in input string. DO NOT INCLUDE ~ OR SPACES IN FORMULA IN COMMAND LINE, will be automatically added) |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
--shrink |
Provide argument to perform shrinkage of effect size on log fold changes. Useful for visualization and ranking of hits |
Example 1: Analyze ribosome profiling data¶
Condition
and Type
factor columns in the sample_info
file. If we want to include the RPF
/ RNA
comparison to account for translation efficiency, we would need to include these factor label as a column to ensure the appropriate RPF
/ RNA
evaluation. To perform a comparison between Tm-treated and Untreated cells, we will provide the TM
and UNTR
labels for the Condition
factor. With the provided design formula used below, we will be calculating:**tm_counts.tsv**
ribo_untr_a ribo_untr_b ribo_tm_a ribo_tm_b untr_a_hek untr_b_hek tm_a_hek tm_b_hek
A1BG 29 43 21 11 67 73 56 85
A2M 3 5 2 2 73 57 32 37
AAAS 1441 1981 934 601 1144 1067 1012 1124
AACS 575 727 310 192 351 335 220 291
AADAT 98 120 51 29 322 315 192 292
**tm_deseq.txt**
Sample Condition Type
untr_a_hek UNTR RNA
untr_b_hek UNTR RNA
ribo_untr_a UNTR RPF
ribo_untr_b UNTR RPF
tm_a_hek TM RNA
tm_b_hek TM RNA
ribo_tm_a TM RPF
ribo_tm_b TM RPF
$ xpresspipe diffxpress -i counts_data.tsv --sample sample_info.txt --design Type+Condition+Type:Condition
TM
vs UNTR
and RPF
(footprints) vs RNA
.**tm_counts_diffx.tsv**
baseMean log2FoldChange lfcSE stat pvalue padj
ATF4 3283.072674 2.542784311 0.134284453 18.93580577 5.78E-80 5.03E-76
PTP4A1 460.6444433 2.473962772 0.185061193 13.36834986 9.26E-41 4.03E-37
SPEN 7902.554413 1.192124338 0.109445545 10.89239713 1.25E-27 3.63E-24
RPS15A 1823.967865 -1.391099082 0.152069954 -9.147757652 5.81E-20 1.26E-16
DYNC1H1 11985.60418 0.85282198 0.094425503 9.031691164 1.69E-19 2.56E-16
log2FoldChange
and padj
columns. From this output, we see that ATF4 is the most significantly upregulated gene by translation efficiency between the TM and UNTR conditions, which is what we expect (see the XPRESSyourself manuscript for further discussion of this example). Further explanations of the other columns of this output can be found in the DESeq2 documentation.Example 2: Analyze RNA-seq data¶
EXP
vs WT
. To ensure this comparison if performed correctly, we need to force these Condition
factor labels to be alphabetical. We will thus rename them b_EXP
and a_WT
and do the following:**expression_counts.tsv**
s1 s2 s3 s4 ...
ENSG00000227232 66 59 1 82 ...
ENSG00000240361 35 0 7 72 ...
ENSG00000238009 20 70 85 78 ...
ENSG00000241860 96 7 93 38 ...
ENSG00000187634 73 41 92 77 ...
**sample_info.tsv**
Sample Condition
s1 a_WT
s2 a_WT
s3 a_WT
s4 a_WT
s5 b_EXP
s6 b_EXP
s7 b_EXP
s8 b_EXP
$ xpresspipe diffxpress -i test_r/test_dataset.tsv --sample test_r/sample_info.tsv --design Condition
Example 3: Analyze RNA-seq data that was prepared in different batches¶
Batch
factor column and provide different batch labels. This example below will control for batch effect and compare EXP
vs WT
expression.**expression_counts.tsv**
s1 s2 s3 s4 ...
ENSG00000227232 66 59 1 82 ...
ENSG00000240361 35 0 7 72 ...
ENSG00000238009 20 70 85 78 ...
ENSG00000241860 96 7 93 38 ...
ENSG00000187634 73 41 92 77 ...
**sample_info.tsv**
Sample Condition Batch
s1 a_WT batch1
s2 a_WT batch1
s3 a_WT batch1
s4 a_WT batch1
s5 b_EXP batch2
s6 b_EXP batch2
s7 b_EXP batch2
s8 b_EXP batch2
$ xpresspipe diffxpress -i test_r/test_dataset.tsv --sample test_r/sample_info.tsv --design Batch+Condition
rRNA Probe¶
rrnaProbe
sub-module was designed to facilitate this process.rrnaProbe
works by doing the following:$ xpresspipe rrnaProbe --help
Required Arguments | Description |
---|---|
-i <path>, --input <path> |
Path to zipped FASTQC files |
-o </path/filename>, --output </path/filename> |
Path and file name to write output |
Optional Arguments | Description |
---|---|
--suppress_version_check |
Suppress version checks and other features that require internet access during processing |
-m <value>, --min_overlap <value> |
Minimum number of bases that must match on a side to combine sequences (default: 5) |
--footprint_only |
Only take zip files that are ribosome profiling footprints (file names must contain “FP”, “RPF”, or “FOOTPRINT”) |
Example 1: Generate rank-ordered list of over-represented sequences¶
$ xpresspipe rrnaProbe -i riboprof_out/fastqc_out/ -o riboprof_out/sequences.txt --footprint_only
TTGATGATTCATAATAACTTTTCGAATCGCAT 514832
TATAAATCATTTGTATACGACTTAGAT 121739
TTGATGATTCATAATAACTTTTCGAATCGCAT 15776
TTTGATGATTCATAATAACTTTTCGAATCGCAC 33325
ATAAATCATTTGTATACGACTTAGAC 13603