Skip navigation and jump directly to page content

 IU Trident Indiana University

Contributing to the Annotation of the Loblolly Pine Genome

Project Leads: Keithanne Mockaitis and Le-Shin Wu

National Center for Genome Analysis Support (NCGAS),UITS Research Technologies, research made possible by Mason, Data Capacitor

High Performance File Systems, UITS Research Technologies

Loblolly Pines in Mississippi Figure 1. The loblolly pine is the most commercially important tree species in the US and the source of much paper manufactured here. It contains the largest genome yet sequenced. Image source: Woodlot from Wikipedia.

A critical component of a successful genome sequencing project is to discover the genes contained within the genome. This step, called gene annotation, is particularly difficult. One approach to gene annotation is to sequence the RNA molecules found in the organism, and map these assembled transcripts back onto the newly assembled genome. This is what was done, with help from NCGAS, for the loblolly pine, which is at the center of a major multi-site sequencing effort. A paper detailing this work, “Unique Features of the Loblolly Pine (Pinus taeda L.) Megagenome Revealed Through Sequence Annotation,” was recently published in Genetics: http://www.genetics.org/content/196/3/891.

The loblolly pine is the most economically important tree in the United States, and the source of most of the wood pulp used to produce paper products. A complete and annotated genome will be used by plant breeders to develop strains of the tree optimized for different growing conditions, or resistant to environmental or biological threats such as drought or disease. This project is particularly difficult because the Loblolly pine genome is the largest plant genome yet sequenced, and is seven times larger than the human genome.

NCGAS bioinformatician Le-Shin Wu, working in close partnership with Indiana University faculty member Keithanne Mockaitis, provided bioinformatic assistance in running de novo RNA-sequence assemblies, and technical support with the Mason cluster. NCGAS additionally provided computational resources specifically designed to support these sorts of compute jobs.


The National Center for Genome Analysis Support supports life science research on the national cyberinfrastructure, enabling the US biological research community to analyze, understand, and make use of the vast amount of genomic information now available. NCGAS focuses particularly on transcriptome- and genome-level assembly, phylogenetics, metagenomics/transcriptomics and community genomics. 

The High Performance File System group provides high-speed, disk-based storage of data for IU researchers. 

NSF GSS Codes:

Primary Field: Genetics 610 - Genome Sciences/Genomics

Secondary Field: Computer Science 401 - Computer Systems Analysis