Developing Good Bioinformatics Habits — Extracting Exonic Variants from A Whole-Genome VCF

Beginning bioinformatics is a challenging task regardless of your prior experience.  There is a reason people refer to bioinformatics as a skill; like most skills, it requires a lot of practice.  Whether you interested in Next-Generation Sequencing (NGS), proteomics or text mining, chances are, pre-existing tools are available for each field.  NGS will serve as the example here, but the principles apply in any field.  This post is an account of my slow realization that bioinformatics skills are truly learned by watching professors, professionals, and Youtubers.

Bioinformatics tools on their own are easy enough to use.  Read the documentation, go to the command line and execute your desired function.  But this strategy goes against everything your being taught!  Efficiency and automation are lost by constantly running to the command line and typing similar tasks over and over.  So let’s go over a common problem in NGS data analysis pipelines and try to efficiently automate this process.

In general, we will follow these steps:

  1. Know the tools — Install the correct software if you haven’t already  (UCSC, tabix and bcftools)
  2. Work with the tools in the command line until you get desired results.
  3. Document your commands that execute properly (save the line in a text editor)
  4. Combine the tools into a script that can be executed to perform a trivial task.

An example problem

Before we start, let’s identify a problem and why it might be useful to save commands and scripts for this task.  If you work with NGS, results for mutation data are in stored in variant call format (VCF) files, developed to aid with the 1000 genomes project.  Your project is to compare mutation results of 500 individuals who had whole-genome sequencing performed.  To make life easier for us, the boss said we should start our investigation in the exonic regions.  The human exome consists the protein coding regions (about 1%) of the total genome, so as you can imagine there is a lot of extra mutation data in the VCF files that we can discard.  But how do we extract exonic regions from whole-genome files?  Then once we figure that out, are we going to have to type the long strings of commands for all 500 individuals?

Know The Tools

One step at a time…  First determine which tools we should use. The first tool only needs to be run one time for the project we are working on.

UCSC table browser is a powerful tool that can create custom .bed files.  These files allow investigators working with one gene, a set of genes, or in our case an entire exome, to obtain genomic coordinates for structurally important regions.  Following along with the link above and the red lines below, first make sure that “knownGene” is selected for table.  Next, select “BED” as output format and name the output file, make sure you select gzip if working with whole exome as the file will be large.  Select “get output” to move on to the next screen.

Red lines indicate the features that need to be selected.
Red lines indicate the features that need to be selected.

The next screen is straight-forward, simply select “Coding Exons” and then select “get BED“.

Remember to select "coding regions"
Remember to select “coding regions”

The resultant file will look like the image below.  bed files are required to have a minimum of chromosome number and genomic locations for start and stop.  Keep in mind that depending on downstream workflow you may need to toggle between “chr1” and “1” in the first column.  The file will probably go to your Downloads folder, save it in a folder that you will remember.

This is the first few lines of an example .bed file.
This is the first few lines of an example .bed file.

Now we have a .bed file that contains locations for every exon of every gene and several hundred VCF files containing millions of mutations!  Since these files are so large and cumbersome to work with, we will have to zip them using bgzip and index them with tabix, documentation here in the samtools distribution.  From here, a good VCF file to practice on would be the dbSNP VCF found here, called All.vcf.  Download it, and save it in folder where you will remember, preferably the same project folder where you save your .bed file.  The next step bcftools, which was developed to work with large VCF fies.   Some whole genome mutation VCF files have millions of mutations.

Work With the Tools & Document Your Commands

Working with the tools to get them to do what you want is an important step in automating processes.  Familiarize yourself with all the functions in the software and save the commands that execute successfully in a text file somewhere so you can come back to them for reference.

First we need to use tabix to index the file All.vcf.gz, which essentially allows software tools trying to access the file to find genomic locations quickly.  It’s pretty simple, let’s try it.  Remember my working directory (~/Biotechpulse) is where all the files I’m working on are located.


All.vcf.tbi is added with the above snippet.  Now that will allow us to work with bcftools to extract only the exonic regions from the whole genome dbsnp VCF.biotechpulse_code2

All_exon_only.vcf is the new file that contains a subset of all the variants from All.vcf.gz.   Now we have completed the task of extracting exonic variants from a whole genome VCF file.  Unfortunately, the time required to run these tools is too long to accomplish any project larger than just a few cases.

Combine the Tools

The final step in our task is to write a shell script that will automate this process.  You could write a script in any scripting language, but I will just reuse most of the command line strings for ease.  This is the most important step!  We have identified a problem that will reoccur on more than one occasion.  We familiarized ourselves with the software tools and documented the commands that executed properly.  Now if we want to perform this task again, lets set up our computer so we can easily do that.

First, create a folder for you shell scripts, or project specific like mine, all of my “NGS scripts” are in the same folder.  Next create the script below using a text editor and save it with name VCF_extract_exome and the file extension .sh if you are using linux shell.


Make the script executable (see red line):


To run the above script run the following from the command line:final_biotechpulse

The key to combining tools into usable scripts is to make the scripts scalable and dynamic.  An example of a dynamic script with respect to our example would be one that prompts the user for a different bed file.  This would be important if you are working with a particular gene set instead of an entire exome.  Most bioinformatics tools are developed with flexibility in mind and scripts that utilize those tools should be equally as flexible.

Thanks for reading, I hope these concepts help you develop a new way of thinking about solving bioinformatics problems!