Running the Uparse pipeline at the UPPMAX cluster

From BILS Wiki
Jump to: navigation, search

NOTE: I am not supporting this pipeline any longer. Programs are still installed and things might work, but there's no guarantee. I have moved my analyses to read correction with the DADA2 package, see Running DADA2 at the UPPMAX cluster for a documentation on how to run that.

Before starting, please read the official Uparse documentation as this page is more a cookbook recipe than a scientific discussion.

The instructions below are meant to be performed one at a time. Usually you must wait for on SLURM job to finish before starting the next.

NOTE: In the code examples below (grey boxes), any text starting with a hash sign ("#") is only a comment and does not need to be typed.

Setting up the project

Environment: PATH and other things

Make sure you have /proj/b2011210/dlbin directory in your PATH variable:

 export PATH=$PATH:/proj/b2011210/dlbin

(Set PATH like above in your .bash_profile, and see setting required environment variables, but note that only the above directory is necessary. You do not need to set R_LIBS_USER, but the load bioinfo-tools is handy if you haven't already included it. You can test if it worked by calling which usearch instead of which ampliconnoise.)

Below I have used some further environment variables in places where your input will differ from the general case. You should set them similar to below in your .bash_profile or a project specific .screenrc file.

 export uproject=bnnn
 export email=my.email@nn.se
 export prefix=_projshort_:

uproject is used to specify your UPPMAX project name in SLURM job submissions and elsewhere, email is your email address and prefix is a string that will be prepended to job names. The prefix is only useful if you're running several projects in parallel. If you don't set it, nothing will happen.

(Consider using screen, a window manager for text terminals. In a .screenrc file you can use the setenv command to set project specific variables.)

Project directory

Create a project directory, preferably under your UPPMAX /proj/$uproject/nobackup/ directory. Something like this perhaps:

 $ mkdir /proj/$uproject/nobackup/projects/my_project/2013_MiSeq_16S
 $ cd /proj/$uproject/nobackup/projects/my_project/2013_MiSeq_16S

Makefile

The program calls in this manual are controlled by a program called make and to be able to execute them you have to create a file called "Makefile" (N.B. the capital M) looking like this:

 include /proj/b2011210/nobackup/dllib/make/makefile.uparse

(If you want to follow these guidelines using your own computer, you can download the above library makefile from UPPMAX (https://export.uppmax.uu.se/b2011210/makefile.uparse). If you're running usearch with a version number less than 7.0.1003 you must also rename your usearch binary to e.g. usearch-v7.0.1001_i86linux32 and download my wrapper usearch script (https://export.uppmax.uu.se/b2011210/usearch). Place the script in the same directory as the usearch-v7.0.1001_i86linux32 binary and make sure the directory is in your PATH ($ echo $PATH to check). After download you can place the makefile.uparse file anywhere you wish, and specify the correct path in the include like above. The usearch binary and the downloaded script must be placed in the same directory, one in your PATH (perhaps ~/bin; make sure that's in your PATH). Make sure to make the script executable ($ chmod +x usearch).)

Data files

Usearch requires uncompressed fastq files.

To use the tools described here, you need to rename your files so that the file containing read 1 ends with .r1.fastq and read 2 .r2.fastq.

You can either do this manually or by setting up a file with your naming scheme. The main advantage of the second choice is that you record which original file was used for which sample.

Sequence prefix

The fastq files created by Illumina sequencing has a prefix. This used to be 'HWI', now it's 'MISEQ'.

In order to get the statistics right, check the first line of one of your files using the head command:

 $ head -n 1 sample0.fastq

If it starts with '@MISEQ' you're fine, if it doesn't your statistics files will be wrong. Fix that by entering the below line, or similar depending on the output above, below the include line in your Makefile:

 SEQ_PREFIX = HWI

Using a name scheme file

Start by listing your INBOX directory with the original fastq files using this command (note that the command will overwrite the sample_list file, so do not run this if you or a colleague already created the file):

 $ ls /proj/$uproject/INBOX/_your_illumina_directory_/fastq/*.fastq* | sed 's/$/:/' > sample_list

Now you have to edit the file using your favourite editor -- e.g. nano, vim or emacs -- by adding the actual sample name plus .r1.fastq or .r2.fastq after the colon on each line. Make sure not to have any spaces in the file, not after the colon nor in the sample file names. The finished file will look something like this:

 /proj/b2013167/INBOX/130814_M00275_0128_000000000-A5BEF/fastq/P478_201_TCGCCTTA-TAGATCGC_L001_R1_001.fastq.gz:P478_201.r1.fastq 
 /proj/b2013167/INBOX/130814_M00275_0128_000000000-A5BEF/fastq/P478_201_TCGCCTTA-TAGATCGC_L001_R2_001.fastq.gz:P478_201.r2.fastq 
 /proj/b2013167/INBOX/130814_M00275_0128_000000000-A5BEF/fastq/P478_202_TCGCCTTA-CTCTCTAT_L001_R1_001.fastq.gz:P478_202.r1.fastq 
 /proj/b2013167/INBOX/130814_M00275_0128_000000000-A5BEF/fastq/P478_202_TCGCCTTA-CTCTCTAT_L001_R2_001.fastq.gz:P478_202.r2.fastq 

When your file is created you can do the copying using the make copy_samples command.

If your original files are compressed (the INBOX file ends with .gz), this is best done as a SLURM job:

 $ sbatch -J ${prefix}copy -A $uproject -t 12:00:00 -p core -n 4 --wrap "make -j 4 copy_samples" --mail-type=ALL --mail-user=$email

Now you have to wait for your job to run. If you provided a correct email address above, you will be notified by email when your job starts, (hopefully not) fails and ends.

See monitoring SLURM jobs.

If your original files are not compressed, just:

 $ make copy_samples

Manual copying of data

If your data files are compressed, do something like this:

 # Set a variable so the following commands get shorter
 $ inbox=/proj/$uproject/INBOX/_your_illumina_directory_/fastq/
 
 # Copy the content of read 1 and read 2 to their respective uncompressed files for sample00
 $ for i in 1 2; do gunzip -c $inbox/P478_217_TTCTGCCT-GTAAGGAG_L001_R${i}_001.fastq.gz > sample00.r$i.fastq; done

If your files are not compressed, replace the last command with:

 $ for i in 1 2; do cp $inbox/P478_217_TTCTGCCT-GTAAGGAG_L001_R${i}_001.fastq sample00.r$i.fastq; done

Stripping barcodes and trimming random bases

SciLifeLab appears to strip sequences of barcodes, but check this if you're uncertain.

You may have a couple of random bases at the start of each r1 sequence, which must be trimmed away before proceeding. To do so add a line like below to your Makefile:

 FIRST_REAL_BASE = 5

The above example will trim four bases off the start of each 5' read. The old file, with untrimmed sequences, will remain in the directory with a .untrimmed suffix.

To run trimming you must first load the Fastx module (unless you already have done so), the run the make command:

 $ module load bioinfo-tools
 $ module load Fastx
 $ make trim_all_r1

Make sure you only run the above command once. A backup of the original file is taken, and that will be overwritten if you run the command twice. The backup files end with .untrimmed. If you need to run the command again, you must move back the original file before you do so. Make sure you do so also if you interrupted the command while it was running.

Merging paired reads

Merging pairs means that two overlapping reads -- one from the 5' end of the fragment, the other from the 3' end -- are merged into one consecutive sequence. Each pair is aligned and a new sequence is estimated for the overlap based on nucleotide sequence and quality ("Phred") score. Each nucleotide gets a new quality score, reflecting consensus from both mates of the pair or conflict together with the original quality scores.

During merging, the quality of the merged region can be used to filter sequences using the options of the fastq_mergepairs mode of usearch. In particular I recommend (following the author's advice) the fastq_truncqual option, but you should also consider the fastq_maxdiffs option.

What parameters to choose is dependent on the length of the overlap'of the pair of sequences. A 2x250 b run will give you a much shorter overlap for a typical V3V4 16S rRNA library than a 2x300. Hence, you should expect more mismatches in the latter case. I recommend you to first try with strict parameters (the default) and relax parameters when you find you get too few sequences. Try first to add a -fastq_maxdiffs 5 option (accepts five mismatches in the merged portion of the sequence).

You set options by overriding the MERGE_OPTIONS in your Makefile. To view what the macro is set to by default:

 $ grep "^MERGE_OPTIONS" /proj/b2011210/nobackup/dllib/make/makefile.uparse

To set fastq_truncqual to 2 and allow only 3 mismatches, add the below line to your Makefile after the line starting with include:

 MERGE_OPTIONS = -fastq_truncqual 2 -fastq_maxdiffs 3

Once you have set options (or have accepted the default), merging of paired reads is done with the make merge_all command. It's kind not to run this on the login node, so submit a SLURM job like this:

 $ sbatch -J ${prefix}merge -A $uproject -t 12:00:00 -p core --wrap "make merge_all" --mail-type=ALL --mail-user=$email

If you provided a correct email address above, you will get emails when the job starts and finishes.

If you have many samples, a parallel node job may be a better choice (assuming 16 compute cores (as in milou) per node):

 $ sbatch -J ${prefix}merge -A $uproject -t 12:00:00 -p node --wrap "make -j 16 merge_all" --mail-type=ALL --mail-user=$email

No or very few merged pairs

Check the number of sequences per sample before and after merging:

 $ make merge_stats.tsv

In case you got no or very few sequences from merging pairs, maybe the criteria set in MERGE_OPTIONS were too strict given the sequence quality you have. In this case, set MERGE_OPTIONS to something less strict and rerun. Note, however, that you need to remove the merged files before you run the merge_all make target. You can choose to remove all merged files, or only those that did not produce a sufficient number of sequences. Be aware that if you choose the latter approach, your sequences in your samples will have different quality.

 $ rm *.merged.fastq          # Remove all merged files
 $ rm sm.merged.fastq         # Remove only sample sm

Quality filtering

Since we get a lot of sequences from a MiSeq run and thus can afford to be stringent, my advice is to use quite stringent quality filtering parameters.

The author of Uparse recommends using a cutoff on the expected number of errors per read as quality check. The expected number of errors is a probabilistic estimation of how many nucleotides will be wrong in the read. You filter out sequences on maximum number of expected errors using the -fastq_maxee option. (I tried three different values for the -fastq_maxee option: Setting it to 0 removed almost all sequences (expected from the typical error frequency of around a percent in Illumina sequences), a value of 1 removed almost 15% while 2 and 3 removed less than 5%.)

Furthermore, it's advisable to delete sequences that are very short, using the -fastq_minlen option. If you are uncertain about what length to use, the below described procedure to calculate quality statistics is advisable.

Quality statistics

This is an optional step that provides information about sequence quality in your samples. So, if you wish, run:

 $ sbatch -J ${prefix}fastqstats -A $uproject -t 12:00:00 -p core --wrap "make merged_stats" --mail-type=ALL --mail-user=$email

When the above job is finished, you can browse the quality of the sequences in individual samples in files ending with .fastqstats.

Filtering

Setting filtering parameters in the Makefile

I have set default quality filter parameters in the library makefile. You can see the default using this command:

 $ grep "^QC_PARAMETERS" /proj/b2011210/nobackup/dllib/make/makefile.uparse

See the documentation for fastq_filter for options you can use.

To change the default, edit your Makefile by adding something similar to the default after the line starting with include. For instance, if you want to allow three expected errors per read and a minimum length of 450, edit your Makefile to look like this:

 include /proj/b2011210/nobackup/dllib/make/makefile.uparse
 
 QC_PARAMETERS = -fastq_minlen 450 -fastq_maxee 3

You can check that you got the parameters right by issuing this make command (the -n instructs make to just echo the commands it would run to standard out):

 $ make -n qc_all

Running the quality filtering

Run the below sbatch command and wait for the job to start and complete:

$ sbatch -J ${prefix}qcfilter -A $uproject -t 12:00:00 -p core --wrap "make qc_all" --mail-type=ALL --mail-user=$email

Summary statistics of sequences after QC

After the quality control you can run the below command to create a tab separated file with summary statistics of the number of sequences per sample, before and after QC, as well as the proportion of discarded sequences.

 $ make qc_sum_stats.tsv
 $ less qc_sum_stats.tsv

OTU clustering

Sample concatenation, dereplication and sorting

NOTE: Unless you want to understand what you're doing by running individual steps, you can proceed directly to the subsection "Sorting" as this will run all required steps in sequence.

For most projects it's advisable to cluster all samples together.

If your project only consists of samples from a single MiSeq run and all fasta files with merged quality filtered sequences are thus found in the same directory, you can just continue to the "Clustering" section.

If you want to cluster samples from several projects or MiSeq runs into one, you need to copy (or, preferably symlink) individual sample fasta files into a new directory and create a Makefile like above.

Concatenation

Concatenate all fasta files of merged, quality filtered sequences using this command:

 $ make all_samples.merged.qc.fasta

Dereplication

Dereplication aims to remove identical sequences from the file created above and will likely reduce running time for the clustering substantially:

 $ sbatch -J ${prefix}derep -A $uproject -t 12:00:00 -p core --wrap "make all_samples.merged.qc.derep.fasta" --mail-type=ALL --mail-user=$email

If the above job is cancelled due to memory overflow, add the below string to your Makefile:

 USEARCH64BIT = 64

Remove the 'makefile.otu_clustering' file and submit like above but with a -n 2

 $ rm makefile.otu_clustering
 $ sbatch -J ${prefix}derep -A $uproject -t 12:00:00 -p core --wrap "make all_samples.merged.qc.derep.fasta" --mail-type=ALL --mail-user=$email -n 2

Increase the number after -n if also this job is cancelled.

Sorting

Usearch reads the fasta file in order, assigning each sequence to a cluster as it finds it. Since a sequence that has been read several times is more likely to be right, starting by the most frequent sequences and working our way down increases the chances the OTU found are real. We will therefore sort sequences from most frequent to least.

At this point you can also throw away sequences that haven't been found a minimal amount of times. By default singletons are deleted in this step. If you do not want to throw away singleton sequences, or if you wish to delete all sequences below another threshold than 2, edit your Makefile to include one of the following lines:

 # No minimum size
 MINSIZE =
 
 # Minimum 5 occurences per sequence
 MINSIZE = -minsize 5

Now you can run this command and wait for it to finish (as mentioned above, you can run this command directly if you wish, it will run the concatenation and the dereplication if needed):

 $ sbatch -J ${prefix}sort -A $uproject -t 12:00:00 -p core --wrap "make all_samples.merged.qc.derep.sorted.fasta" --mail-type=ALL --mail-user=$email

To see how many unique sequences you got after dereplication and singleton removal, run:

 $ grep -c '>' all_samples.merged.qc.fasta all_samples.merged.qc.derep.fasta all_samples.merged.qc.derep.sorted.fasta

Clustering

OTU clustering using usearch uses a heuristic algorithm to define clusters of sequences where all sequences in a cluster have a maximum proportion of nucleotides that differs from the centroid sequence, also known as the seed sequence for the cluster. Traditionally, OTUs have been defined as all sequences 97% identical to each other, which can be interpreted as meaning that no pair of sequences in a cluster differ more than 3% (complete linkage clustering). How Usearch's heuristic, centroid approach can be compared with complete linkage clusters is difficult to determine in detail, but by approximattion the distance from the center of a cluster to the perimeter (the radius) should be half that of the longest distance (the diameter). One could thus argue for forming usearch OTUs on the 98.5% identity level -- or a radius of 1.5% as it is determined in usearch -- for comparison with traditional 97% OTUs.

For practical purposes, you will probably try different levels and I have prepared for that. In the Makefile you can specify the RADIUSES macro to a list of radiuses (i.e. maximum distance from the seed sequence in percent) and run clustering for all at once. There is a default for the RADIUSES macro of course:

 $ grep "^RADIUSES" /proj/b2011210/nobackup/dllib/make/makefile.uparse

If you want other levels, type a line like below after the include line:

 RADIUSES = 0.5 1.0 1.5 2.0

and regenerate the separate makefile for OTU clustering:

 $ rm -f makefile.otu_clustering ; make makefile.otu_clustering

Run the actual clustering as a SLURM job:

 $ sbatch -J ${prefix}cluster -A $uproject -t 24:00:00 -p core --wrap "make all_otu_levels_fasta" --mail-type=ALL --mail-user=$email

The process will create files named e.g. all_samples.merged.qc.derep.sorted.otus_r2.0.fasta containing the seed sequences for each cluster. The names of sequences have the form OTU_000091.

Mapping original sequences against OTUs

Since we ran the clustering only on unique sequences and only output seed sequences in the clustering step, we do not at this stage know what sequences in each sample belongs to which OTU. To get this information, we need to map back individual sequences from samples to the clusters:

 $ sbatch -J ${prefix}mapback -A $uproject -t 24:00:00 -p core --wrap "make all_mapback_samples" --mail-type=ALL --mail-user=$email

The above will create subdirectories, one for each clustering level (with names ending with .mapback.d), with each sample mapped against the seed sequences of the clusters.

Creating OTU tables

The mapping step above created one cluster file for each sample and cluster level, placed in a subdirectory named after the cluster level. Run this command to list the directories you have:

 $ ls -ld *.mapback.d

The mapping files end with .uc:

 $ ls -l all_samples.merged.qc.derep.sorted.otus_r2.0.mapback.d

The .uc files can be converted into OTU tables, i.e. tab separated files containing OTUs as rows and samples as columns.

 $ make all_otutables

Mapping statistics

Once the above step is done, you can create tab separated statistics files using the following command:

 $ make all_otusums

One file for each clustering level will be output. Filenames will end with .otusums.tsv. View the contents using less:

 $ less all_samples.merged.qc.derep.sorted.otus_r1.5.otusums.tsv

Subsampling

Subsampling for alpha diversity measurements, is easily done in Explicet with bootstrapping, see below "Using Explicet".

Taxonomic identification

Statistical analyses of communities are usually based on OTUs as defined above, but in most cases it is also important to get an estimate of the taxonomic belonging of OTUs. Typically this is performed by comparing the seed sequence with databases such as Silva, RDP or Greengenes. To classify your OTUs you need to run a program -- BLAST or other similarity search -- with the fasta file of seed sequences against the database you choose. After the search you sometimes need to run a program that produces a classification using a set of good similarity hits as input.

I have setup some tools for common use, see Taxonomic identification of clusters.

After running any of the above you need to create symbolic links ending with .taxonomy to the created tables. For instance, if you ran SINA:

 $ for f in *.silvataxonomy; do nf=`echo $f|sed 's/sid_...//'`; ln -s $f `basename $nf .silvataxonomy`.taxonomy; done

After this you can create OTU tables with taxonomy information:

 $ make all_otutaxtables

Tips for R-based statistical analyses

This page: Using R to analyse data from the AmpliconNoise pipeline, contains some canned analyses and tips on how to perform some simple analyses in R using your OTU clustered data.

Despite its name, it is suitable for Uparse clustered data.

Using Explicet

A tool called Explicet seems to be a nice graphical client for OTU analysis.

You can turn all your otutaxtables into files formated for Explicet import (format 5 in the handbook, "OTU Table Counts") using this command:

 $ make all_explicets

I have written a small Explicet tutorial suited for data generated with the protocol described here.

Author

--Daniel Lundin 12:41, 22 January 2014 (CET)

Parts of the text was copied, with permission, from Luisa Hugerth, SciLifeLab.