Friday, February 25, 2011

The Licheniad

The Licheniad

by Sean Q. Beeching

Canto I

A lichen, one may theorize,
When on the future casts his eyes,
His dear descendants he descries.

Eternal life, so it appears,
And a youth that lasts a thousand years,
The lichen spurns as cause for tears.

He dreams, or she, perhaps I’ll say,
Of numerous, happy, progeny,
With whom it would, perforce, parté.

The truth to which one must attest,
Is that our lives may not be blest,
And nuclear winter may end the fest.

In which event the lichen too,
Will perish along with me and you,
Unless a lifeboat he can constue.

But how to effect the goal sublime?
To reproduce two souls entwined,
Involves a course most labyrinthine.

Too sadly he must bid ado,
To sex, I’m sorry, but it’s true,
The ordinary method won’t work for two.

By sex what here we represent,
Does not demand adult consent,
Simple meiosis is all that’s meant.

These poor dears, for aught we know,
Lack the genders and hence forgo,
What here to say would not be apropos.

And had they genders, why they’d be four,
Two for the fungus, for the alga two more,
As state of affairs one would deplore.

It would lead to confusion,
To mishap and exclusion,
And not, in fine, to the hoped for diffusion.

One half, if the better, I’ll not say,
May be engendered in the usual way,
What results, I’m afraid, is a lichen manqué.

Oft tales are told of the sailing spores,
Which travel the heavens beyond our shores
And by means unknown the lichen restores.

Yet how the bionts reunite,
On what rare moonlit starry night;
It remains conjecture, that secret rite.

Yet how the lichen reunites,
By what fantastic arcane rites
Remains unknown, they’re unseen sights.

(Betwixt the preceding tercets twain
Neither of the other could the advantage gain,
I’ve left them both as a short refrain.)

Though it may happen, it seems farfetched,
Belief must needs be sorely stretched,
To credit the procedure I have thus far sketched.

Instead the lichen puts his trust,
In structures far removed from lust;
They seem to us no more than dust.

Within the thallus, all unseen,
Assembles the lichen his breeding machine,
As in an ant hill he were the queen.

At length the surface of its skin,
Reveals the tumult deep within,
It writhes and wrinkles and waxes thin.

Upon its face begin to vent,
Depending on the creature’s bent,
Features odd but of small extent.

Their smallness is indeed a test
Of our student’s eyesight. They protest,
And think our labels a cruel jest.

They’re each quite different, we declare,
Which drives them all to black despair,
As at the mocking plants they stare.

And truth to tell not even we,
Are always sure which one we see,
Isidia, soredia, which could it be?

Regardless which these fly aloft,
Both symbionts combined but soft,
They fall to earth, not seldom but oft.

Back on the earth, behold, it sprouts,
It lives, it breathes, convention flouts,
And grows to manhood, or thereabouts.

Thus, dear readers, have you now heard,
A tale as marvelous but far less blurred,
Than that of the logos, the living word.

Should that comparison seem extreme;
By it am I seen to blaspheme:
It’s nothing, I assure you, but blown off steam.

Tuesday, February 22, 2011

Fast UniFrac for barcoded 454 16S amplicons

When analyzing large 16S sequence data sets from multiple samples to infer community patterns, one of the best analytical methods available is Fast UniFrac.  Barcoded 454 sequencing (a type of 'pyrosequencing') is quickly becoming the method of choice for generating the data for these types of analyses, and Fast UniFrac has been built especially to handle the large data sets generated using this method.  However, there are simple data management issues that can keep researchers from using this methodology.  One major hindrance with Fast UniFrac is that, if one wants to take advantage of the 16S reference tree that is built in to the program (something that becomes almost obligatory with sufficiently large data sets), it is necessary to put all of the sequence names in a specific format that reveals the sample of origin while preserving the individual sequence identifiers.  Below I describe two distinct procedures that I developed for assembling the input file for the first step of the Fast UniFrac pyrosequencing analysis procedure.  I have tested the first procedure on a data set of ~120,000 sequences (this required sending an email to request a higher Fast UniFrac quota, which is typically capped at 100,000 sequences) of ~500bp in length and the second procedure on a data set of ~40,000 sequences of the same approximate length.

Files needed:
Sequence files ('.fna'+'.qual' or '.sff' or '.fasta'+'.groups') with sequence names that are all the same length (standard for 454 data)
Oligos file with primers and barcodes for each sample (for use with Mothur) [http://www.mothur.org/wiki/Trim.seqs]

Special programs needed:
Mothur
A text editor that can perform search and replace on returns (e.g., TextWrangler for Macintosh or TextPad for Windows)
Microsoft Excel 2008 (previous versions max out at ~65000 rows and may do other unexpected things to large data sets)
BLAST (local)
PyCogent
Enthought Python Distribution (I used v6.3) (alternatively, you can download Python and Numpy, but the EPD has versions of these that play well together and should hopefully work with PyCogent as long as they are first in your path... getting Python, Numpy, and PyCogent to talk can be more complicated than it would seem)

Method #1 (no UNIX scripting required):

1) Take the '.fna' (fasta) and '.qual' (quality) files (or alternatively the '.sff' file) plus the manually-produced '.oligos' file for your data set and process them with Mothur using the 'trim.seqs' function [http://www.mothur.org/wiki/Trim.seqs].  This will produce a '.fasta' file with your sequences and a '.groups' file that shows which sequences come from which samples ('environments' in the language of UniFrac).  I recommend using the many functions of Mothur to further cull your data set; however, the amazing power of Mothur will be the subject of future posts.  If you do decide to change the composition of the '.fasta' file in any way (this is always necessary at least to some degree), you can use the Mothur 'list.seqs' function (performed on the finalized '.fasta' file) to generate an '.accnos' file, then follow that by the 'get.seqs' function (performed on the '.accnos' file and the original '.groups' file) to generate a finalized '.groups' file that correlates with the finalized '.fasta' file.

2) Open the '.groups' file in Microsoft Excel and edit the file to create a four-column spreadsheet: (A) '>' on every row; (B) sample names; (C) delimiter (I use '#') on every row, and (D) sequence names.  Highlight all four columns and sort ascending according to column D (sequence names).  This is done by going to 'Data' > 'Sort' > 'Column D', 'ascending'.  Save as 'file1.csv' (comma-delimited text).

3) Open the new 'file1.csv' file in a text editor and remove all commas (this can be easily done in TextWrangler for Mac by opening the file and going to 'Search' > 'Find', putting ',' in the 'Find:' box, typing nothing in the 'Replace:' box, and clicking 'Replace All').  Save as 'file2.txt' (plain text).

4) Open the '.fasta' file with a text editor (e.g., TextWrangler).  Remove all returns (e.g., Find: '\r', Replace: '' in TextWrangler with the 'Grep' box checked on the Find/Replace screen).  Replace all instances of '>' with a return and '>' (e.g., '\r>'; remember that 'Grep' must be checked if you are using TextWrangler).  Now the first line is empty; manually delete that line.  Save as 'file3.txt' (plain text).

5) Import 'file2.txt' into the first column of Microsoft Excel (column A).  On the same spreadsheet, import the names and sequences from file3.txt to the second and third columns (columns B and C) by clicking on cell B1 (column B, row 1) and importing 'file3.txt' as a text file with 'fixed width'; set the field width for the column break manually at the interface between the sequence name and the beginning of the nucleotide sequence (for my data, it was around the 15th place; if this doesn't work with the text import wizard, fixed-width delimitation can be performed under 'Text to Columns...' in the 'Data' menu).  Highlight columns B and C *only* (if you highlight column A as well, this will not work).  Go to 'Data' > 'Sort' > 'Column B', 'ascending'.  At this point, you should have rows that look something like: '>CLSt#GR5DBVW03HJKND' '>GR5DBVW03HJKND' 'TACGATCGATCGATCAGCATCGATCA...' where the columns are correlated with one another and reading across a row should show the same identifier in column A as in column B.  Be sure that this is the case throughout the spreadsheet.  Now delete column B (the one with the identifiers from the imported fasta file).  Save as 'file4.csv' (comma-delimited text).

6) Open 'file4.csv' with a text editor.  Replace all instances of ',,' with a single return (e.g., find ',,' and replace with '\r' in TextWrangler... remember to have the 'Grep' box checked).  Save as 'file5.fasta'.

7) Follow the instructions for "The BLAST to GreenGenes protocol" and 1-7 of the "Steps" in the Fast UniFrac tutorial (download raw pairwise distance matrices for further analyses), all found here:
http://bmf2.colorado.edu/fastunifrac/tutorial.psp

Method #2 (simpler overall, but requires assembling and running a customized UNIX shell script):

1) Take the '.fna' (fasta) and '.qual' (quality) files (or alternatively the '.sff' file) plus the manually-produced '.oligos' file for your data set and process them with Mothur using the 'trim.seqs' function [http://www.mothur.org/wiki/Trim.seqs].  This will produce a '.fasta' file with your sequences and a '.groups' file that shows which sequences come from which samples ('environments' in the language of UniFrac).  I recommend using the many functions of Mothur to further cull your data set; however, the amazing power of Mothur will be the subject of future posts.  If you do decide to change the composition of the '.fasta' file in any way (this is always necessary at least to some degree), you can use the Mothur 'list.seqs' function (performed on the finalized '.fasta' file) to generate an '.accnos' file, then follow that by the 'get.seqs' function (performed on the '.accnos' file and the original '.groups' file) to generate a finalized '.groups' file that correlates with the finalized '.fasta' file.

2) Create and run a shell script on the '.fasta' file that looks something like this:

#!/bin/bash
#$ -S /bin/bash 
#$ -cwd
#$ -o search_replace.log -j y


sed -i s/GMQ03P202B3SVL/MID08#GMQ03P202B3SVL/g file_name.fasta
sed -i s/GMQ03P202BTOMD/MID08#GMQ03P202BTOMD/g file_name.fasta
sed -i s/GMQ03P202CGYZW/MID08#GMQ03P202CGYZW/g file_name.fasta
sed -i s/GMQ03P202BT4E9/MID08#GMQ03P202BT4E9/g file_name.fasta
sed -i s/GMQ03P202BXELV/MID08#GMQ03P202BXELV/g file_name.fasta

It will have to contain a row for every sequence.  This sort of thing can be assembled in Microsoft Excel:
Column A: 'sed -i s/' all the way down the column
Column B: sequence IDs (from a Mothur '.groups' file)
Column C: backslashes all the way down the column
Column D: sample identifiers (correlated with the sequence IDs... easily done if the sequence IDs and sample identifiers are both taken from a Mothur '.groups' file and the order is preserved)
Column E: delimiter (e.g., #) all the way down the column
Column F: identical to column B
Column G: '/g file_name.fasta' all the way down the column
After that is put together, save it as comma-delimited text, open it with a text editor, remove all commas, add the first few lines manually to make it a working script, and run.

3) Follow the instructions for "The BLAST to GreenGenes protocol" and 1-7 of the "Steps" in the Fast UniFrac tutorial (download raw pairwise distance matrices for further analyses), all found here: http://bmf2.colorado.edu/fastunifrac/tutorial.psp


Once you have the raw pairwise distance matrices, they can be analyzed in R or another statistical package/program.  This will be the subject of a later post.  As part of the Fast UniFrac protocol (when using the GreenGeenes core backbone tree), it will be necessary to use PyCogent.  If you have trouble with PyCogent itself (or the command-line in general), please see my earlier post with some tricks and tips on getting that part of the Fast UniFrac 'BLAST-to-GreenGenes' protocol to work.  Hopefully the current post can help someone in constructing the initial input file for the first part of the Fast UniFrac procedure!

- Brendan

P.S. This procedure places '#' in the sequence names as the delimiter. This information will be needed during the PyCogent-dependent portion of the Fast UniFrac protocol. Here is an example of what I type in the command line once I have navigated to the folder where I have placed the 'create_unifrac_env_file_BLAST.py' python script and the 'blast_output...' file (the confusing thing with my nomenclature is that the Python input is named 'output' because it is the BLAST output, while the Python output is named 'input' because it becomes the Fast UniFrac input; the logic behind this is that the essence of this step is the transformation of the BLAST output file into the Fast UniFrac input file):
python create_unifrac_env_file_BLAST.py blast_output_allreplaced_ready4python.txt fastunifrac_input_file.txt #

Update: These instructions are now published as part of my dissertation (Hodkinson 2011) and an article in Environmental Microbiology (Hodkinson et al. 2012a); the supporting data/analysis/instruction files for the latter are available from the Dryad data repository (Hodkinson et al. 2012b). 

----------------------------------------------

References

The above instructions are published in the following sources:

Hodkinson, B. P. 2011. A phylogenetic, ecological, and functional characterization of non-photoautotrophic bacteria in the lichen microbiome. Doctoral Dissertation, Duke University, Durham, NC.
Download Dissertation (PDF file)

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2012a. Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Environmental Microbiology 14(1): 147-161.

Hodkinson, B. P., N. R. Gottel, C. W. Schadt, and F. Lutzoni. 2012b. Data from: Photoautotrophic symbiont and geography are major factors affecting highly structured and diverse bacterial communities in the lichen microbiome. Dryad Digital Repository doi:10.5061/dryad.t99b1.

----------------------------------------------

The development of the above protocols was supported in part by NSF (DEB-1011504) and the US Department of Energy.

Thursday, February 10, 2011

PyCogent for Fast UniFrac

As someone studying the composition of lichen-associated bacterial communities, I have generated several data sets of 16S rRNA gene sequences from bacteria that live in this specialized niche. Beyond the simple question of "who lives there?" we can start to use phylogenetic inferences to examine the ecology of this niche by comparing sets of 16S sequences from different communities and taking into account where the different members fall in a phylogeny. UniFrac is a tool that allows the integration of phylogenetic information into ecological comparative community analyses, and its hip new cousin Fast UniFrac is all the rage these days. But, alas, fully utilizing the special features of Fast UniFrac (such as mapping pyrosequencing reads to a reference phylogeny) requires PyCogent, the installation of which has given me much grief recently. 

PyCogent is a great Python-based toolkit that can be used for conducting a number of analyses on biological sequence data (DNA, RNA, proteins); it is billed as "making sense from sequence" (Knight et al. 2007).  There is a good guide to PyCogent known as the PyCogent Cookbook.  Some programs/packages/pipelines that depend on PyCogent include QIIME and Fast UniFrac (for the latter, PyCogent is required only if you have a large 16S data set that requires a guide tree).

I have had trouble getting the different versions of Python, NumPy, and PyCogent to communicate with one another through UNIX (on both CentOS and MacOSX... although all of the various versions of the different dependencies may have been an issue, since I do not own the machines and I run several versions of Python myself locally), but I ran through the simple 2-step protocol listed below on Windows XP and Windows 7
 and it worked very well for running the Python script associated with the Fast UniFrac 'BLAST-to-GreenGenes' protocol. This is a little odd since it is written that installation of PyCogent by itself is not supported for Windows... and the procedure that I outline below seems to be a pretty simple way to get it installed.

Installing and running PyCogent requires using the command line. If you would like to do this on a Windows machine and you are unfamiliar with the Windows command line, you can google tutorials on "MS-DOS" and/or "command prompt".  There is a decent introductory guide here. The instructions below are written in a broad, inclusive way so that they should work with a UNIX-based system as well (including Macintosh; if you are a Mac user and are unfamiliar with the command line, you can google something like "Mac OSX Terminal" or find a good beginners' tutorial here).  

Whatever type of system it is, the PATH variables must be set correctly so that the programs can find one another.  As long as you do not have previous versions of Python, NumPy, or PyCogent installed, Windows should automatically set the environmental variables so that this protocol will work without a hitch (Macintosh most likely will not set the variables automatically because it usually comes with a pre-installed Python that it will always want to use).  Click here to see a post that further addresses one of the issues with the wrong version of Python/NumPy getting in the way. 

Here is my simplistic protocol for getting PyCogent moving enough to run the Python script mentioned above (I should note that this protocol is not approved by the makers of PyCogent, since it may not produce a fully-functional package, but it does allow me to run the script):

1) Installing Python, NumPy, etc.:
Install the most recent version of the Enthought Python Distribution package (free for academics).

2) Installing PyCogent:
Download the most recent version of PyCogent ('.tgz' file).
Unzip the folder (using, e.g., WinRAR, WinZip, or 7-Zip; an automatic partial unzip might leave it as '.gz' but one of the previously mentioned programs will allow you to unzip it fully and you can drag the folder to your desktop if necessary).
In the command line, navigate to the PyCogent directory.
Type in the command line:
python setup.py install

There are some further notes on installation here and in the README, but please note that it was the fact that these instructions didn't quite get me to where I was going that inspired me to write this post. Still, they are likely to provide exactly what is needed for most situations.

Depending on the sort of jobs you need to run using PyCogent, a single computer may or may not have enough computing power.  I have an interest in PyCogent because I need it to run the aforementioned script that makes the Fast UniFrac '.env' input file (see the Fast UniFrac tutorial for more details on how this fits into the overall Fast UniFrac protocol).  A single computer processor has more than enough computing power to handle this job, but some of the more advanced QIIME functions will certainly require greater power for sufficiently large data sets.

Hopefully the notes here can make Fast UniFrac more accessible to more people (specifically, when the mapping of pyrosequencing reads to a reference tree is required), since the various errors that may occur with PyCogent, NumPy, Python, etc. can be difficult. If you wish to use PyCogent directly, you will probably have to be somewhat familiar with the Python programming language, although the cookbook has enough examples that one may be able to stumble through it naively (not that I would recommend it). If you're like me, and only use PyCogent so that you can map sequences to a reference tree for Fast UniFrac, then everything else you'll need to know can probably be found in the excellent Fast UniFrac tutorial. The Fast UniFrac 'BLAST-to-GreenGenes' procedure also requires a local installation of BLAST (installation instructions for PC, Mac, Linux, etc.). Making the initial input file for this specific type of Fast UniFrac analysis can require some creative thinking, and will be the subject of a future post.

- Brendan