Thursday, February 28, 2013

PandaSeq to QIIME

Recently I have been working with some paired-end amplicon MiSeq data. One important step with paired-end amplicon data is assembling each pair of reads to make composite sequences. PandaSeq is a good program for doing this, but when assembling two reads, it seems to use only the portion of the identifier shared between the two reads as the identifier for the assembled read. This becomes problematic for me when performing downstream analyses with QIIME. Therefore, I wrote the following simple Perl script to make the identifiers of the PandaSeq assembly file jive with the identifiers in the corresponding indexing reads file generated after the MiSeq run (so they can both be processed together using QIIME):


#!/usr/bin/perl

my $filename = <$ARGV[0]>;
chomp $filename;
open (FASTQ, $filename);
{
if ($filename =~ /(.*)\.[^.]*/)
{
open OUT, ">$1.fixed.fastq";
}
}

while ()
{
if ($_ =~ /^\@(\w*\-\w*\:\d*\:\w*\-\w*\:\d*\:\d*\:\d*\:\d*)\:/)
{
print OUT "\@$1 2:N:0:\n";
}
else
{
print OUT $_;
}
}


The above text can be copied into a file (to make the Perl script) and then invoked with the following:
perl script.pl assembly.fastq

Note that these instructions presuppose that you have three .fastq files from your paired-end MiSeq run:
01 - Reads from one end of each amplicon
02 - Index reads
03 - Reads from the other end of each amplicon
These files are generated by default when making .fastq files with some Illumina software, but sometimes making all three files (notably the indexing read file) requires specifying certain parameters. As of version 1.6.0 of QIIME, the group of indexing reads must be entered as a separate file if these data are to be properly integrated into the QIIME workflow.

One final issue that arises once reads have been assembled is that there are now fewer reads in the assembly file than there are in the index file. This can be remedied by making a barcode file with only the entries associated with sequences in the PandaSeq assembled data set. I use the following two commands (after running the above Perl script) to take care of this issue:
sed -n '1~4'p assembly.fixed.fastq | sed 's/^@//g' > defs_in_assembly.txt
filter_fasta.py -f SampleID_NoIndex_L001_R2_001.fastq -o index_reads_filtered.fastq -s defs_in_assembly.txt

I then do a final check to see if the entries in the index file and the assembly file are truly the same:
sed -n '1~4'p index_reads_filtered.fastq | sed 's/^@//g' > index_defs_filtered.txt
diff -s index_defs_filtered.txt defs_in_assembly.txt

If those two files are the same, then the .fastq files should be ready to run through the split_libraries_fastq.py script to start the QIIME workflow.

Please let me know if you use the above Perl script or if you run into issues with any of this!

- Brendan


[Update - I just got back a data set from another facility in which the first part of the identifier for each sequence (the section before the first colon) was written in a slightly different format. To deal with this, the above 'if' line that comes after 'while' should read as follows:
if ($_ =~ /^\@(\w*\:\d*\:\w*\-\w*\:\d*\:\d*\:\d*\:\d*)\:/)
If you are not sure which format your identifiers take, it may be best to try the script as written above and then try it with this modification.]

Saturday, February 9, 2013

Lichens of GSMNP

A book just came out last week entitled "The Lichens and Allied Fungi of Great Smoky Mountains National Park." It documents the lichen diversity of GSMNP in a far more comprehensive way than has ever been done before. Here is the official summary from Amazon:

"Like the Great Smoky Mountains themselves, much about the lichens of the Smokies has remained shrouded in mystery. This book sheds considerable light on the diversity of these intriguing organisms in the Smokies, a diversity that is unmatched in any other American national park. Written by three of this country's foremost lichen specialists and based on their extensive field and herbarium studies, this book is a comprehensive summary of current knowledge of the lichen biota of Great Smoky Mountains National Park. Included in this treatment: revised and annotated checklist; comprehensive keys to all 804 known species of lichenized, lichenicolous, and allied fungi; extensive ecological notes on noteworthy discoveries; discussion of records for new and interesting taxa; formal description of 2 genera and 12 species new to science; color micrographs illustrating all new genera, and species distribution maps for selected species."

In this book, I co-authored a number of new taxonomic combinations based on morphological and molecular research that I have conducted. The book also includes several species that I have described in past works. This publication will serve as a great resource for researchers studying lichen diversity in eastern North America, since the southern Appalachians (as far as we have documented so far) seem to house more lichen diversity in a smaller area than any other part of the region.

- Brendan