The “War on Cancer” was declared over 40 years ago. Despite tremendous advances in understanding cancer biology and developing cancer treatments, it remains a significant cause of suffering and death. We will describe Python-based data management and analysis tools and show how they have enabled a novel flow cytometry-based technology focused on studying disease biology to improve cancer outcomes.
In the more than 40 years of the “War on Cancer” significant effort has been applied to increasing our knowledge of cancer biology and translating this knowledge to treatment. Part of the challenge in translating these scientific and medical advances into effective therapies is the heterogeneity of the disease. Nodality is a biotechnology startup developing and applying a flow cytometry-based technology called Single Cell Network Profiling (SCNP) in the areas of oncology and autoimmunity to better understand disease biology. SCNP is used to characterize individual patients with the aim of selecting optimal, individualized treatment strategies. In addition SCNP is applied to drug discovery and development.
To support this mission Nodality has developed a data management and analysis pipeline using Python which integrates a large stack of Python technologies. We will describe the problem domain and the challenges associated with our highly complex, multidimensional biological data. Finally we describe the technology stack and show how it has enabled and accelerated biological research, biomarker discovery, clinical studies, and clinical test development.
Nodality’s SCNP technology platform generates high dimensional quantitative information on cellular function in a high-throughput manner. Specifically, for every patient sample we collect the response of every cell of multiple types to a variety of molecular stimuli; these responses are measured on multiple readouts representing key biological signaling pathways. Concretely this means we capture approximately 3,000,000 data points per patient; each data point has multiple metadata elements associated with it. Our goal is to join this rich and deep data with clinical facts such as individual patient disease outcomes to develop actionable biological and clinical information.
The Python tools are used in all phases of studies and laboratory management and by all members of study teams: (Computational Scientists, Biologists, and Clinical Staff)
We found Python to have the right mix of developer/data scientist productivity, “batteries included”, cross platform capabilities, and strong ecosystem. We will describe why we chose Python and how we have leveraged many libraries generously created by Python developers to increase throughput of our work flow, with the aim of rapidly delivering high quality data, models, and data visualizations.
We have integrated numpy, scipy, SQLAlchemy, matplotlib, RPy2, pandas, Cython, and Django to solve the problems described above. We will describe the architecture for our tools, and, importantly, practical lessons learned in supporting high throughput, high dimensional experimentation. We will discuss our “in the trenches” experiences with and solutions to scaling to large data sets, developing customized analyses, and deploying tools to end users.
We will show how Python is concretely being used to advance biology and clinical science with the goal of improving patient outcomes. In this context we will provide practical information on using Python to manage, mine, and present complex high dimensional data.