Developing Good Bioinformatics Habits — Extracting Exonic Variants from A Whole-Genome VCF

Beginning bioinformatics is a challenging task regardless of your prior experience.  There is a reason people refer to bioinformatics as a skill; like most skills, it requires a lot of practice.  Whether you interested in Next-Generation Sequencing (NGS), proteomics or text mining, chances are, pre-existing tools are available for each field.  NGS will serve as the example here, but the principles apply in any field.  This post is an account of my slow realization that bioinformatics skills are truly learned by watching professors, professionals, and Youtubers.

Bioinformatics tools on their own are easy enough to use.  Read the documentation, go to the command line and execute your desired function.  But this strategy goes against everything your being taught!  Efficiency and automation are lost by constantly running to the command line and typing similar tasks over and over.  So let’s go over a common problem in NGS data analysis pipelines and try to efficiently automate this process.

In general, we will follow these steps:

  1. Know the tools — Install the correct software if you haven’t already  (UCSC, tabix and bcftools)
  2. Work with the tools in the command line until you get desired results.
  3. Document your commands that execute properly (save the line in a text editor)
  4. Combine the tools into a script that can be executed to perform a trivial task.

An example problem

Before we start, let’s identify a problem and why it might be useful to save commands and scripts for this task.  If you work with NGS, results for mutation data are in stored in variant call format (VCF) files, developed to aid with the 1000 genomes project.  Your project is to compare mutation results of 500 individuals who had whole-genome sequencing performed.  To make life easier for us, the boss said we should start our investigation in the exonic regions.  The human exome consists the protein coding regions (about 1%) of the total genome, so as you can imagine there is a lot of extra mutation data in the VCF files that we can discard.  But how do we extract exonic regions from whole-genome files?  Then once we figure that out, are we going to have to type the long strings of commands for all 500 individuals?

Know The Tools

One step at a time…  First determine which tools we should use. The first tool only needs to be run one time for the project we are working on.

UCSC table browser is a powerful tool that can create custom .bed files.  These files allow investigators working with one gene, a set of genes, or in our case an entire exome, to obtain genomic coordinates for structurally important regions.  Following along with the link above and the red lines below, first make sure that “knownGene” is selected for table.  Next, select “BED” as output format and name the output file, make sure you select gzip if working with whole exome as the file will be large.  Select “get output” to move on to the next screen.

Red lines indicate the features that need to be selected.
Red lines indicate the features that need to be selected.

The next screen is straight-forward, simply select “Coding Exons” and then select “get BED“.

Remember to select "coding regions"
Remember to select “coding regions”

The resultant file will look like the image below.  bed files are required to have a minimum of chromosome number and genomic locations for start and stop.  Keep in mind that depending on downstream workflow you may need to toggle between “chr1” and “1” in the first column.  The file will probably go to your Downloads folder, save it in a folder that you will remember.

This is the first few lines of an example .bed file.
This is the first few lines of an example .bed file.

Now we have a .bed file that contains locations for every exon of every gene and several hundred VCF files containing millions of mutations!  Since these files are so large and cumbersome to work with, we will have to zip them using bgzip and index them with tabix, documentation here in the samtools distribution.  From here, a good VCF file to practice on would be the dbSNP VCF found here, called All.vcf.  Download it, and save it in folder where you will remember, preferably the same project folder where you save your .bed file.  The next step bcftools, which was developed to work with large VCF fies.   Some whole genome mutation VCF files have millions of mutations.

Work With the Tools & Document Your Commands

Working with the tools to get them to do what you want is an important step in automating processes.  Familiarize yourself with all the functions in the software and save the commands that execute successfully in a text file somewhere so you can come back to them for reference.

First we need to use tabix to index the file All.vcf.gz, which essentially allows software tools trying to access the file to find genomic locations quickly.  It’s pretty simple, let’s try it.  Remember my working directory (~/Biotechpulse) is where all the files I’m working on are located.

biotechpulse_code1

All.vcf.tbi is added with the above snippet.  Now that will allow us to work with bcftools to extract only the exonic regions from the whole genome dbsnp VCF.biotechpulse_code2

All_exon_only.vcf is the new file that contains a subset of all the variants from All.vcf.gz.   Now we have completed the task of extracting exonic variants from a whole genome VCF file.  Unfortunately, the time required to run these tools is too long to accomplish any project larger than just a few cases.

Combine the Tools

The final step in our task is to write a shell script that will automate this process.  You could write a script in any scripting language, but I will just reuse most of the command line strings for ease.  This is the most important step!  We have identified a problem that will reoccur on more than one occasion.  We familiarized ourselves with the software tools and documented the commands that executed properly.  Now if we want to perform this task again, lets set up our computer so we can easily do that.

First, create a folder for you shell scripts, or project specific like mine, all of my “NGS scripts” are in the same folder.  Next create the script below using a text editor and save it with name VCF_extract_exome and the file extension .sh if you are using linux shell.

biotechpulse

Make the script executable (see red line):

biotechpulse_scriptexec

To run the above script run the following from the command line:final_biotechpulse

The key to combining tools into usable scripts is to make the scripts scalable and dynamic.  An example of a dynamic script with respect to our example would be one that prompts the user for a different bed file.  This would be important if you are working with a particular gene set instead of an entire exome.  Most bioinformatics tools are developed with flexibility in mind and scripts that utilize those tools should be equally as flexible.

Thanks for reading, I hope these concepts help you develop a new way of thinking about solving bioinformatics problems!

Kevin

Advertisements

Mapping Next Generation Sequencing Reads to a Reference Genome: A comparison of algorithms in BWA

BWA as a Software Tool for DNA Sequence Alignment/Mapping Reads

This page serves as a working tutorial for myself as well as others.  I created this while learning the steps in the process for aligning reads.  These are my preferred installations, naming conventions, etc…

Aligning .fastq files to a reference genome is the first step in the biological sequence analysis pipeline.  There are several software tools available that perform the alignment.   In my experience these tools are easier to use and typically intended to be installed on unix-like operating systems.  The tools are accessed via the command line, so familiarize yourself with the structure of passing arguments and options.

There are several methods for accomplishing alignment, the following are my preferred methods on an Ubutntu distribution.

I used an example single-end .fastq file retrieved from NCBI’s Sequence Read Archive, with 2,294,708 reads spanning 29 genes.

WAIT… BEFORE YOU START!!!

It is extremely important to make sure that the reference genome you choose is compatible with the tools you are going to be using.

For example, GATK recommends GRCh37, which you can get from their ftp site.  The other popular option for the human genome is hg19 which UCSC genome browser uses.  The biggest difference between the two is CRCh37 .fasta files have a “>chr1” in their header, whereas hg19 .fasta’s have “>1” in their header.  (NCBI has recently updated the newest genome assembly GRCh38 as of Dec 2013)

The human genome is large, about 3.1 GB, so it is packaged in separate files, separated by chromosome.  Best practice is to combine them using the concatenate command in Linux shell.  Make sure you concatenate them sorted karyotypically not alphanumerically.  For Example:

kevin@kevin:~$ cd /path/to/ref_fasta/
kevin@kevin:~/path/to/ref_fasta$ cat chr1.fa chr2.fa chr3.fa ... chrX.fa chrY.fa chrMT.fa > GRCh37_sorted.fa

This produces a reference genome in .fasta format.  Go ahead and delete the individual chromosome .fasta files, they are no longer needed.

BWA

BWA is an alignment software recommended by GATK.  The sourceforge page including installation and documentation is found here.

Before the reference .fasta can be used, it must be indexed for speed considerations.  The following will index your reference fasta using bwa.  Be sure to use “-a bwtsw” which is an (-a)lgorithm for handling large genomes.

kevin@kevin:~$ bwa index -a bwtsw path/to/ref_fasta/reference.fa

Now the reads contained in .fastq files can be mapped to the reference genome.  There are three different algorithms included, each with their own pros and cons.

  1. bwa aln — samse/sampe
  2. bwasw
  3. bwa mem

Unpaired (or single-end) .fastq with default settings for each is shown below.  The following will generate a .sam file (sequence alignment/map format)
kevin@kevin:~$ bwa aln ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example.fastq > bwa_example.sai
kevin@kevin:~$ bwa samse ~/path/to/ref_fasta/ref.fasta ~path/to/sai/bwa_example.sai ~/path/to/fastq/example.fastq > example_aln_samse.sam


kevin@kevin:~$ bwa mem ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example.fastq > example_mem.sam

kevin@kevin:~$ bwasw ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example.fastq > example_sw.sam

BWA has several settings for advanced users that help align indels, trim reads, and in general, increase efficiency of mapping reads.  I performed alignment on the same .fastq file using several algorithms while recording time and accuracy for comparison. The results are in the graph below.  bwa-mem and bwasw achieved adequate results using default parameters, hence they were only used once.  bwa-aln achieved sub-optimal results with default parameters, so I tried tweaking some settings.  The purpose was to show that bwa-aln requires careful fine tuning and is often involved.

-e [integer] allows for a gap extension of up to [integer] base pairs, which can increase sensitivity of finding deletions. (default -1, which disallows long gaps)

-E [integer] allows the user to define the penalty for extending the gap (default 4)

This graph describes the time required and accuracy of each algorithm in BWA.  The algorithms are listed along the X axis with the settings chosen in parentheses.  Time is on Y1, in seconds.  Accuracy is depicted on Y2 as % Reads that successfully mapped to the reference genome.  Notice that bwa-aln is slower and less accurate than the newer bwa-mem and bwasw.
This graph describes the time required and accuracy of each algorithm in BWA. The algorithms are listed along the X axis with the settings chosen in parentheses. Time is on Y1, in seconds. Accuracy is depicted on Y2 as % Reads that successfully mapped to the reference genome. Notice that bwa-aln is slower and less accurate than the newer bwa-mem and bwasw.

 

Paired-end reads must be mapped individually, separated by forward and reverse fastq; also a different alignment algorithm is used (aln/sampe).  I did not show results from paired-end reads.
kevin@kevin:~$ bwa aln ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example_1.fastq > bwa_example_1.sai
kevin@kevin:~$ bwa aln ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example_2.fastq > bwa_example_2.sai
kevin@kevin:~$ bwa sampe ~/path/to/ref_fasta/ref.fasta ~path/to/sai/bwa_example_1.sai ~path/to/sai/bwa_example_2.sai ~/path/to/fastq/example_1.fastq ~/path/to/fastq/example_2.fastq > example.sam

Samtools & Integrated Genomics Viewer

Samtools will convert .sam files that  were generated in the previous steps to .bam files.  This site gives step by step instructions for converting from sam to bam, sorting and indexing a .bam.

One method of visualizing the reads in a .bam file is the IGV from Broad.  I’m not going to go into details, but the idea is:

  1. Choose the same reference genome that you used to index your reference genome from the previous steps.
  2. Upload your .bam file (index file “.bai” must be in same folder as .bam that you upload)
  3. Type in the gene of interest in the search bar.
  4. Zoom in to see the reads, they aren’t visible at low power.
IGV view of sequence reads in MSH6 gene using bwa-aln-samse with default settings.  Notice there are only 12 deletions.
IGV view of sequence reads in MSH6 gene using bwa-aln-samse with default settings. Notice there are only 12 deletions.
IGV view of sequencing reads in MSH6 gene in the exact location as above.  The -e 5 -E 1 settings which allow for deletions to map to the reference genome easier.  Notice compared to the default settings (12) there are
IGV view of sequencing reads in MSH6 gene in the exact location as above. The -e 5 -E 1 settings allow for deletions to map to the reference genome easier. Notice compared to the default settings (12) there are now 18 reads with deletions.

Every computer is set up a little differently invoking different usage and calls to the software.  Comments and suggestions welcome!