PyCon 2016 in Portland, Or
hills next to breadcrumb illustration

Optimizing DNA jigsaw-puzzle solving by visualization

Samarth Rangavittal

Audience level:
Novice
Category:
Science

Description

To analyze DNA information, letters produced by sequencing machines must be "assembled" into a superstring : the genome. Graph-based heuristics perform admirably, but require extensive & expensive parameter tuning on each sample. We propose a Python-based converter for fast DNA-assembly visualization, which can be used to optimize assembly without re-running code multiple times.

Abstract

Ever since the human genome was deciphered in 2001, DNA sequencing has been used for a number of applications : ranging from understanding human health and disease, to collecting samples from different species in the wild to preserve biodiversity. Although the costs of sequencing a genome has fallen from $1 million to $1000 in just the last few years, the cost of data analysis remains high. This data is usually in the form of strings of a few hundred letters in length, and must be assembled into a continuous, long chromosome of a million to billion letters. A number of graph-traversal based programs have been developed for this process, each with their own set of advantages and types of datasets to operate on. Since every organism's DNA is unique, the parameters for assembly need to be optimized on a genome-by-genome basis. Re-running the graph-build-and-traverse computation is expensive, especially on large datasets such as the human genome. One possible solution for this is to visualize the output of one single run of a DNA assembler. Visually comparing the connected components of the raw-data-graph, with the components of the final assembly graph has two advantages. First, it allows scientists to design better experiments, and second, it helps programmers to tune their assemblers for each case without having to re-run the entire process. Here, I present new work from our group on visualizing assemblies from long read data with a tool we implemented in Python, which uses modules from open source libraries such as Biopython and pyfaidx for indexing and DNA sequence processing. Our program can run to completion even on large genomes within a few minutes on a desktop computer, and can be accessed at : https://github.com/md5sam/Falcon2Fastg.git