Saturday 2:35 p.m.–3:05 p.m.
Data intensive biology in the cloud: instrumenting ALL the things
- Audience level:
Cloud computing offers some great opportunities for science, but most cloud computing platforms are I/O and memory limited, and hence are poor matches for data-intensive computing. After 4 years of research software development we are now instrumenting and benchmarking our analysis pipelines; numbers, lessons learned, and future plans will be discussed. Everything is open source.
The cloud provides great opportunities for a variety of important computational science challenges, including reproducible science, standardized computational workflows, comparative benchmarking, and focused optimization. It can also help be a disruptive force for the betterment of science by eliminating the need for large infrastructure investments and supporting exploratory computational science on previously challenging scales. However, most cloud computing use in science so far has focused on relatively mundane "pleasantly parallel" problems. Our lab has spent many moons addressing a large, non-parallelizable "big data/big graph" problem -- sequence assembly -- with a mixture of Python and C++, some fun new data structures and algorithms, and a lot of cloud computing. Most recently we have been working on open computational "protocols", worfklows, and pipelines for democritizing certain kinds of sequence analysis. As part of this work we are tackling issues of standardized test data sets to support comparative benchmarking, targeted optimization, reproducible science, and computational standardization in biology. In this talk I'll discuss our efforts to understand where our computational bottlenecks are, what kinds of optimization and parallelization efforts make sense financially, and how the cloud is enabling us to be usefully disruptive. As a bonus I'll talk about how the focus on pleasantly paralellizable tasks has warped everyone's brains and convinced them that engineering, not research, is really interesting.