BLAST on the Open Science Grid Provides a Time-Saving Alternative
Project Leads: Craig Stewart (NCGAS – Bloomington IN), Miron Livny (OSG – Madison Wisconsin)
National Center for Genome Analysis Support & High Throughput Computing, Science Community Tools Group, UITS Research Technologies
Funded by: National Science Foundation grants 1062432, 1148698
The Basic Local Alignment Search Tool (BLAST), an algorithm for comparing primary biological sequence information, is one of the most widely used tools in bioinformatics. The National Center for Genome Analysis Support (NCGAS) and the Indiana University (IU) High Throughput Computing (HTC) group have been experimenting with using the Galaxy web-based user interface to submit BLAST jobs on the Open Science Grid (OSG). The outcome of this experimentation is the development of a faster method of deploying BLAST jobs and getting results back more quickly than one could achieve by using the supercomputers at IU.
Galaxy at IU provides a web-based platform for data-intensive genome analysis research. It employs IU’s Mason cluster for compute services and the IU Data Capacitor for project storage, and is hosted on IU’s Quarry Gateway Web Services Hosting System. Galaxy is a scientific workflow platform that makes computational biology easier for research scientists who do not know computer programming. NCGAS has created Galaxy portals for IU investigators and NSF-funded life science researchers nationally. These portals provide ready access to the full suite of genome assembly, annotation, alignment, and other applications—as well as the file transfer and transformation utilities necessary to build genome science workflows. Today’s technologies for genome sequencing are faster and cheaper, and create more sequence data than ever before. With limited local computing resources, the work of analyzing, understanding, and using these vast amounts of genomic information becomes challenging in terms of efficiency. One solution is to split a single, sizeable analysis task into many independent, smaller tasks and then distribute them to multiple computing resources in parallel. The combination of Galaxy’s easy-to-use interface and BLAST on OSG’s splitting and distributing functionality makes it easier for the researcher to get more done in less time.
The OSG can support large amounts of central processing unit (CPU) hours simultaneously. Soichi Hayashi in HTC has been researching a way to run BLAST in parallel by splitting up the target database into many chunks and making it run in a distributed, high-throughput computing (DHTC) environment, namely the OSG. In turn, Carrie Ganote (NCGAS) has enabled OSG BLAST on IU’s Galaxy interface. Ganote says that the interface for running BLAST on OSG will provide an alternative to the National Center for Biotechnology Information BLAST servers, which are wonderful for small jobs and parameter tinkering, but prohibitively slow for large jobs.
Moore, Greg (2014). BLAST on OSG provides a timesaving alternative for large-scale analysis. Web. Accessed 16 Oct 2014. Retrieved from http://www.opensciencegrid.org/blast-on-osg-provides-a-timesaving-alternative-for-large-scale-analysis-2
Hayashi, S., Gesing, S., Quick, R., Teige, S., Ganote, C., Wu, Le-S., Prout, E. (2014). Galaxy based BLAST submission to distributed national high throughput computing resources. Presentation. Presented at the International Symposium on Grids and Clouds (ISGC) 2014 March 23-28. Academia Sinica, Taipei, Taiwan. http://hdl.handle.net/2022/18609
The High Throughput Computing group supports use of High throughput computing by the IU and national research communities. HTC is primarily funded by the National Science Foundation.
The National Center for Genome Analysis Support enables the biological research community of the US to analyze, understand, and make use of the vast amount of genomic information now available. NCGAS focuses particularly on transcriptome- and genome-level assembly, phylogenetics, metagenomics/transcriptomics, and community genomics.
NSF GSS Codes:
Primary Field: Genetics (610) - Genome Sciences/Genomics
Secondary Field: Computer Science (401) Computer Systems Analysis