Beginner’s Guide¶

First Steps¶

If this is your first time doing any programming, congratulations! You are embarking upon a very rewarding path. As with learning any new spoken language, there is a learning curve associated with learning a computer language. While XPRESSpipe is aimed at reducing the majority, if not (hopefully) all of the overhead associated with processing this data, using this software will still require some effort, just as would learning any new language or laboratory technique.

XPRESSpipe is run through the command line interface (or CLI). This may seem daunting, but luckily, several free online courses are available to quickly catch you up to speed on some of the basics that will be required to use this software. We recommend Codecademy’s CLI course, which you can find here and should take only a couple of hours (Codecademy estimates ~10 hours, but you probably don’t need to finish the course to be ready to use XPRESSpipe. The purpose of this is to help you become more comfortable with the command line). We recommend watching the walkthrough videos found on the quickstart page.

Once you’re ready to jump into the command line, we can get rolling! For the steps below, we’re going to assume we are on an Mac operating system and provide examples under this pretext, but this software is compatible with any Linux-like operating system and the syntax is largely the same (sorry Windows users! – but if you have a newer version of Windows, you may be able to use a Linux-flavored environment).

Install XPRESSpipe¶

Please refer to the installation documentation or the walkthrough video below:

Note

The pip install . method has been replaced with a script that is executed by running bash install.sh.

Generate Reference Files¶

Before we can process our raw RNA-seq data, we need to create a reference directory (or for a folder, in other terms). In this example, we will be working with human-derived RNA-seq data, so let’s perform the following in the command line:

$ cd ~/Desktop
$ mkdir reference_folder
$ mkdir reference_folder/fasta_files

The first command helped us navigate to the Desktop. The  icon is a shortcut for the User directory, and every directory needs to be separated by a /
The second command created a new folder in the Desktop directory called reference_folder
The third command created a new folder in the reference directory for intermediate reference files

Now let’s get the reference files. We’re going to do this directly in the command line, but if you have trouble with this, I will explain an alternative afterwards. Quick note, because the next lines of code are a bit long, I used the :data:`` character to indicate I am continuing the command in the next line. You not include these characters when executing the command, they just help make the code more readable. We will first read the retrieval commands into a file which will additionally act as a log file for the version for the genome version we are using.

You should modify the the variable calls between the # signs. For GTF_URL, you should change the URL currently provided to the one appropriate for your organism of interest. Make sure you are downloading the GTF file and NOT the GFF file. For FASTA_URL, you should do the same as before with the URL to the chromosome DNA FASTA files, but you should only copy the URL up to “chromosome”, but not include the chromosome identifier. For CHROMOSOMES, type out the chromosome identifiers you want to download between the ” characters with a space between each.

Note

I do not personally recommend using the toplevel genome sequence files. Whenever I have used these, I often run into a memory overload error during genome curation.

$ cd reference_folder/

### Change specific organism file names based on your organism of interest ###
$ echo 'GTF_URL=ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz' >> fetch.sh
$ echo 'FASTA_URL=ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome' >> fetch.sh
$ echo 'CHROMOSOMES="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y"' >> fetch.sh

$ echo 'curl -O $GTF_URL' >> fetch.sh
$ echo 'gzip -d Homo_sapiens.GRCh38.97.gtf.gz' >> fetch.sh
$ echo 'mv Homo_sapiens.GRCh38.97.gtf transcripts.gtf' >> fetch.sh
$ echo 'cd fasta_files/' >> fetch.sh
$ echo 'for X in $CHROMOSOMES; ' >> fetch.sh
$ echo 'do curl -O ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.${X}.fa.gz; done ' >> fetch.sh
$ echo 'gzip -d *.gz' >> fetch.sh
$ echo 'cd ../' >> fetch.sh
$ bash fetch.sh

Let’s discuss what we just did:

1. We navigated into the reference folder, downloaded a GTF reference file and unzipped it, then navigated to the fasta_file directory to download the raw reference data and unzipped it. Finally, we returned to the main reference directory.
2. If this didn’t work, we can navigate to Ensembl to download the relevant data. We need to get the GTF and DNA chromosomal FASTA files for our organism of interest. The link to the chromosome sequence files actually contains more files than we need. We just need the files that start with Homo_sapiens.GRCh38.dna.chromosome. You can download them, move them to the appropriate directories within your reference directory, and unzip the files by double-clicking on them.

Now we need to curate these references files into something the sequencing alignment software can use. Since we are using ribosome profiling data, we want a reference that will allow us to avoid mapping to the 5’ and 3’ ends of genes. We also don’t want to align to anything but protein coding genes. Finally, we want to quantify to the longest transcript (although this is not required except in certain cases for downstream analysis compatibility). This last bit just helps the software avoid confusion when a gene has multiple splice variants to choose from. Since this is short read sequencing (let’s say we were doing 50 bp single-end sequencing), we also want to factor this into the curation of the reference (see the --sjdbOverhang argument below).

$ xpresspipe curateReference \
              --output ./ \
              --fasta fasta_files/ \
              --gtf ./transcripts.gtf \
              --protein_coding \
              --truncate \
              --sjdbOverhang 49

### or ###

$ xpresspipe build

### And then choose the curate option ###

- The truncation option is only necessary when using XPRESSpipe to process ribosome profiling samples and their associated RNA-seq samples.
- If interested in quantifying miRNA, etc, leave out the --protein_coding argument.
- If running sequencing where the read (single-end) or mates not equal to 100 bp, you will want to change the --sjdbOverhang argument to be the length of one of the paired-end reads - 1, so if we ran 2x100bp sequencing, we would specify --sjdbOverhang 99 (although in this case, the default of --sjdbOverhang 100 is just fine). If you changed this number, remember this for the next steps as you will need to provide it again if changed here.
- This may take awhile, and as we will discuss later, you may want to run these steps on a supercomputer, but this will serve as a preliminary guide for now.
- One final consideration – if we are dealing with an organism with a smaller genome size, we will want to provide the --genome_size parameter with the the number of nucleotides in the organism’s genome. If you change this parameter in this step, you will need to provide the parameter and value in the align, riboseq, seRNAseq, and seRNAseq modules.

Process Raw Sequencing Files¶

Now let’s get our raw data::
Make a new folder, something called raw_data or whatever you like and place your data there.
Make sure the files follow proper naming conventions (see naming conventions at General Usage)
Now let’s process the data
Let’s also create a folder called something like output
Also, make sure you have the 3’ adapter sequence handy used when generating your sequencing library
We’ll feed the program the new GTF file that contains only longest transcript, protein coding, truncated references generating in the reference curation step
We’ll give the experiment a name and also specify what method of sample normalization we want performed on the count data
We also need to specify the --sjdbOverhang amount we fed into the reference curation step, so in this case we will use --sjdbOverhang 49

$ xpresspipe riboseq --input raw_data/ \
                    --output output/ \
                    --reference reference_folder/ \
                    --gtf reference_folder/transcripts_LCT.gtf
                    --experiment riboseq_test
                    --adapter CTGTAGGCACCATCAAT
                    --method RPKM
                    --sjdbOverhang 49

### or ###

$ xpresspipe build

### And then choose the appropriate pipeline to build

If you are running a lot of files, especially for human samples, this may take a lot of time. We recommend running this on some kind of server. A situation like yeast with few samples may be feasible to run on a personal computer, but will likely also take some time.

Sequencing Metrics¶

In your output folder, you will see a file named riboseq_test_multiqc_report.html. This file will compile the statistics from each processing step of the pipeline for each sample file you provided as input. Things like read quality, mapping, and quantification statistics can be found here. Just double-click the file or execute the following command to open in your default browser window.

$ open riboseq_test_multiqc_report.html

Library Complexity¶

Within the complexity directory in your output folder, you will find summary PDFs for all samples processed analyzing library complexity of each sample.

Metagene Analysis¶

Within the metagene directory in your output folder, you will find summary PDFs for all samples processed analyzing the metagene profile of each sample.

Periodicity (Ribosome Profiling)¶

Within the periodicity directory in your output folder, you will find summary PDFs for all samples processed analyzing ribosome periodicity of each of each sample containing reads 28-30nt.

Count Data and Downstream Analysis¶

Within the counts directory in your output folder, you will find individual counts tables for each sample, as well as compiled tables for each sample that was processed.

Supercomputing¶

Install¶

Much of the same commands will be performed as above, aside from a couple key modifications.

1. Navigate to your user home directory on the supercomputer:

$ cd ~

2. Install Anaconda if not already done and follow the prompts given when running the bash script. We recommend letting the installer set up the required PATHS needed for interfacing with Anaconda:

$ curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

3. Install the XPRESSpipe package. The following will download the current development version of XPRESSpipe. When installing a specific version of XPRESSpipe, do so as you would from the above instructions. You may need to modify the directory name for the XPRESSpipe files if you do so.

$ git clone https://github.com/XPRESSyourself/XPRESSpipe.git
$ conda env create -f ./XPRESSpipe/requirements.yml
$ conda activate xpresspipe
$ pip install ./XPRESSpipe

4. Let’s test this to make sure everything is operating properly:

$ cd ~
$ xpresspipe test

Run Data¶

1. Assuming you installed the XPRESSpipe dependencies in a conda environment called xpresspipe, you will use the following as a template. If you named the conda environment something else, you would replace the line conda activate xpresspipe with conda activate env_name. If dependencies were installed to the base environment, the source $(conda... and conda activate ... lines are unnecessary.

2. The commands here are the same as above, but likely the method of execution will be different. A lot of supercomputing clusters manage job submission through a system called SLURM. Each supercomputing cluster should have individualized and tailored instructions for proper usage. We will briefly provide an example of how one would submit a job to a SLURM batch system:

#!/bin/bash
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH -o /scratch/general/lustre/$USER/slurmjob-%j
#SBATCH --partition=this_cluster_has_no_name

source $(conda info --base)/etc/profile.d/conda.sh
source activate xpresspipe


#set up the temporary directory
SCRDIR=/scratch/general/lustre/$USER/$SLURM_JOBID
mkdir -p $SCRDIR

# Provide location of raw data and parent reference directory
SRA=/scratch/general/lustre/$USER/files/your_favorite_experiment_goes_here
REF=/scratch/general/lustre/$USER/references/fantastic_creature_reference

# Send raw data to your Scratch directory
mkdir $SCRDIR/input
cp $SRA/*.fastq $SCRDIR/input/.

# Make an output directory
mkdir $SCRDIR/output
cd $SCRDIR/.

xpresspipe riboseq -i $SCRDIR/input -o $SCRDIR/output/ -r $REF --gtf $REF/transcripts_CT.gtf -e this_is_a_test -a CTGTAGGCACCATCAAT --sjdbOverhang

3. To queue this script into the job pool, you would do the following:

$ sbatch my_batch_script.sh

4. To monitor the progress of your job, execute the following:

$ watch -n1 squeue -u $USER

After the job is finished, you can export the data as shown in the next section.

Explore the Data¶

Once the data is finished processing, we can start exploring the output. Explanations each quality control analysis can be found in the Analysis section of the documentation.

In order to get the data from a HPC to your personal computer, you can use a command like the following:

$ cd ~/Desktop # Or any other location where you want to store and analyze the data
$ scp USERNAME@myCluster.chpc.university.edu:/full/path/to/files/file_name.suffix ./