Sunday, November 20, 2011

My New Supercomputing Center

Recently I applied for and received an XSEDE (Extreme Science and Engineering Discovery Environment) Startup Allocation Award (DEB-110024), allowing me to use the San Diego Supercomputing Center (SDSC) Trestles cluster. The cluster itself contains over 10,000 processor cores; you can read more about it here. While there are many different types of research being conducted using this cluster, my work focuses on ecological and evolutionary analyses of molecular sequence data. Previously I conducted my computationally-intensive analyses using the Duke Shared Cluster Resource (DSCR). However, since leaving Duke University a few months ago, I have been without a cluster. I am happy to say that I now have access to a cluster once more! Let the supercomputing resume!

- Brendan

Thursday, November 10, 2011

Outreach and training through Youtube videos

Over the past few months I have acted as the New York Botanical Garden's Workflow Coordinator for the NSF-funded "Digitization TCN Collaborative Research: North American Lichens and Bryophytes: Sensitive Indicators of Environmental Quality and Change" (EF-1115086).  This project is a collaboration between multiple institutions across North America and is aimed at cataloging label data from the vast majority of North American lichen and bryophyte specimens.  Recently, as part of this project, we at NYBG released two videos on Youtube.  The first acts as an introduction to the project for the general public and gives some of the rationale behind it:


The second is a training video that can be used by members of the partner institutions or others who are thinking about taking on similar projects:


As you will probably notice from the videos themselves, Charlie Zimmerman, the imager that I have supervised for this project here at the garden, was the one who did most of the writing, filming and editing... and we greatly appreciate all of his work!

Please enjoy the videos!

- Brendan

Sunday, November 6, 2011

Building linkage-probability-based RNA secondary structure models for phylogenetic inference

RNA secondary structure models are increasingly being integrated into likelihood-based phylogenetic inferences, but the dynamic structure of functional RNA molecules makes any single structural inference necessarily inaccurate. In this post I present an objective method for determining which elements of secondary structure are most stable based on the statistical significance of linkage probabilities between sites on a given RNA molecule. I briefly outline how this information can be integrated into a phylogenetic analysis by creating an input file that contains these statistically significant structural elements.

For some additional background on RNA secondary structure, see this previous post:
http://squamules.blogspot.com/2011/08/its-rna-secondary-structure.html

Functional RNA molecules include pairs of nucleotide sites that are linked to one another physically, resulting in specific secondary structures that define the shape of each molecule.  This linkage causes certain sites to evolve in tandem with their counterparts.  As such, the secondary structure of RNA molecules has been recognized for some time as a significant consideration in the inference of phylogenies from functional RNA-encoding genes (Kimura 1985, Tillier and Collins 1995).  Typically, RNA secondary structure is used to optimize multiple-sequence-alignment accuracy for functional RNA-encoding genes (Gutell et al. 1992, Kjer 1995, Lendemer and Hodkinson 2009).  However, in recent years, the use of secondary structure in modeling evolution for likelihood-based phylogenetic inferences has begun to gain popularity (Hodkinson and Lendemer in review, Savill et al. 2001, Telford et al. 2005).  This approach requires defining the pairs of linked sites on the encoded RNA molecule and treating these pairs as states that are separate from the standard independent nucleotide states (A, C, T, G) in a phylogenetic inference.  This method allows one to properly account for the interdependency of interacting nucleotides, since paired RNA nucleotides are no longer required to be treated as independent sites, leading to a more accurate approach for modeling sequence evolution.

Current protocols for integrating RNA secondary structure data into phylogenetic analyses require a single hypothetical structure to be used as an input.  Structures are typically inferred using algorithms that minimize free energy or use other thermodynamic considerations to produce the best single structural inference (Mathews & Turner 2006).  However, RNA secondary structure is dynamic, frequently changing in the cell as RNA catalyzes reactions and performs various cellular functions.  When RNA molecules encounter certain enzymes and cellular components, the thermodynamic rules that previously favored one structure might strongly favor another.  Additionally, different methods of structural inference are not always comparable, and small differences in algorithms can favor significantly different structures; the use of differing structural models in a phylogenetic context can have consequences in terms of both topology inference and the calculation of support (Ullrich et al. 2010).

These problems can largely be solved by removing statistically non-significant linkages from phylogenetic analyses, leaving only the most probable structural elements to be incorporated into downstream inferences.  The determination of which RNA secondary structural elements are supported with statistical significance is often overlooked and is certainly not a standard part of the current work flow for scientists integrating secondary structural data into phylogenetic analyses.

Since RNA secondary structure can serve as such a useful tool for revealing the evolutionary history of certain groups, it is essential that objective criteria be established for incorporating structural elements into phylogenetic inferences.  The simple method outlined here allows one (a) to evaluate the probability that each site on an RNA-encoding gene is linked to each other site and (b) to produce an 'elemental' secondary structure model for phylogenetic inference containing only the statistically-supported elements of the structure.

The UNAFold package provides a particularly useful set of tools for exploring various aspects of RNA secondary structure (Markham and Zuker 2008).  UNAFold's 'hybrid-ss.exe' yields a set of '.plot' files that give the probability of each base binding to each other base for all reasonable pairings.  After installing UNAFold and running 'hybrid-ss.exe' on a FASTA-formatted sequence, one can choose the '.plot' file with the number that most closely approximates the typical cellular temperature (in degrees Celsius) of the organism from which the sequence is derived.  This '.plot' file can be modified in Excel by sorting according to 'P(i,j)' values (the probability of pairing) and isolating only the rows for which 'P(i,j)' is above 0.95.  This stringent 95% pairing probability cut-off seems most easily justifiable; however, other cut-off values could potentially be used in the context of this method.

For integrating this type of data into a phylogenetic analysis (e.g., using RAxML 7.2.8; http://wwwkramer.in.tum.de/exelixis/software.html; Stamatakis 2006), the standard 'Vienna' dot-bracket notation is used (Hofacker et al. 1994).  Any standard secondary structure inference program can be used to create an initial structure that may serve as a template; parentheses can be converted to periods using a standard text-editor or secondary structure editing program (e.g., 4SALE; http://4sale.bioapps.biozentrum.uni-wuerzburg.de/; Seibel et al. 2006) for sites whose linkage is statistically non-significant.  These procedures will produce a secondary structure model that includes only the statistically supported elements of structure.  When this 'elemental' secondary structure model is incorporated into phylogenetic analyses, it could serve to decrease the degree of uncertainty inserted into the standard secondary structure-based inferences.

Future advances may allow the integration of various intermediate linkage probabilities to be considered in the calculation of tree likelihoods.  However, it seems that certain theoretical hurdles remain to be overcome before this type of analysis can be possible.  Meanwhile, a methodology like the one outlined here could be beneficial if one wishes to reduce the amount of chance introduced into phylogenetic analyses while still accounting for the fact that certain sites are inextricably linked.

- Brendan

---------------------------------------------------

References

Gutell, R. R., A. Power, G. Z. Hertz, E. J. Putz, and G. D. Stormo. 1992. Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods. Nucleic Acids Research 20(21): 5785–5795.

Hodkinson, B. P., and J. C. Lendemer. In review. Systematics of a enigmatic sterile crustose lichen. 

Hofacker, I. L., W. Fontana, P. F. Stadler, S. Bonhoeffer, M. Tacker, and P. Schuster. 1994. Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie / Chemical Monthly 125: 167-188.

Kimura, M. 1985. The role of compensatory neutral mutations in molecular evolution. Journal of Genetics 64(1):7-19.

Kjer, K. M. 1995. Use of rRNA secondary structure in phylogenetic studies to identify homologous positions: an example of alignment and data presentation from frogs. Molecular Phylogenetics and Evolution 4: 314-330.

Lendemer, J. C., and B. P. Hodkinson. 2009. The wisdom of fools: new molecular and morphological insights into the North American apodetiate species of Cladonia. Opuscula Philolichenum 7: 79-100.

Markham, N., and M. Zuker. 2008. UNAFold: software for nucleic acid folding and hybridization. Methods in Molecular Biology 453: 3-31.

Mathews, D. H., and D. H. Turner. 2006. Prediction of RNA secondary structure by free energy minimization. Journal of Molecular Biology 16(3): 270-278.

Savill N. J., D. C. Hoyle, and P. G. Higgs. 2001. RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum likelihood methods. Genetics 157: 399-411.

Seibel P. N., T. Müller, T. Dandekar, J. Schultz, and M. Wolf. 2006. 4SALE - A tool for synchronous RNA sequence and secondary structure alignment and editing. BMC Bioinformatics 7: 498.

Stamatakis, A. 2006. RAxML-VI-HPC: maximum likelihood- based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688-2690.

Telford, M., M. Wise, and V. Gowri-Shankar. 2005. Consideration of RNA secondary structure significantly improves likelihood-based estimates of phylogeny: examples from the Bilateria. Molecular Biology and Evolution 22: 1129-1136.

Tillier, E. R. M., and R. A. Collins. 1995. Neighbor-joining and maximum likelihood with RNA sequences: addressing interdependence of sites. Molecular Biology and Evolution 12: 7-15.

Ullrich, B., K. Reinhold, O. Niehuis, and B. Misof. 2010. Secondary structure and phylogenetic analysis of the internal transcribed spacers 1 and 2 of bush crickets (Orthoptera: Tettigoniidae: Barbitistini). Journal of Zoological Systematics and Evolutionary Research 48(3): 219-228.

---------------------------------------------------

This article can be cited as:
Hodkinson, B. P. 2011. Building linkage-probability-based RNA secondary structure models for phylogenetic inference. Squamules Unlimited, New York. [Available at: http://squamules.blogspot.com/2011/11/building-linkage-probability-based-rna.html]