Variant, a term once only known to the researchers of biological sciences, is now quite familiar to the general people. Rising of the new variants of SARS-Cov2 virus with novel mutations have become a topic of concern during this COVID-19 pandemic. How do the researchers identify these variants from the analysis of genomics data? How could Python be used in this analysis? This talk will address these questions.
Mutations in any organism are usually identified after performing a Next Generation Sequence analysis experiment named variant calling. Variant calling generates the output in a specialized file format called Variant Call Format (VCF) file. VCF file carries the meta data and the information of thousands of mutations and is generally large in size. Thus, it is challenging to extract information and identify mutations from this file, especially when there are hundreds of samples. The Python package scikit-allel provides utilities for exploring this large-scale genetic variation data in VCF file and helps to identify important mutations from the downstream analysis. This package depends on scipy, matplotlib, seaborn, pandas, scikit-learn, h5py and zarr. After identifying the mutations, the next step is the visualization of the mutations in a meaningful way. This task might be simpler for a small size virus like SARS-Cov2, but complicated for eukaryotic organisms with multiple chromosomes like mouse or human. Another python package QMplot is handy and useful for the visualization of thousands of mutations in each chromosome, making the interpretation of the extracted mutations easier for the biologists. This package uses numpy, scipy, pandas and matplotlib.