Mapping Next Generation Sequencing Reads to a Reference Genome: A comparison of algorithms in BWA

BWA as a Software Tool for DNA Sequence Alignment/Mapping Reads

This page serves as a working tutorial for myself as well as others.  I created this while learning the steps in the process for aligning reads.  These are my preferred installations, naming conventions, etc…

Aligning .fastq files to a reference genome is the first step in the biological sequence analysis pipeline.  There are several software tools available that perform the alignment.   In my experience these tools are easier to use and typically intended to be installed on unix-like operating systems.  The tools are accessed via the command line, so familiarize yourself with the structure of passing arguments and options.

There are several methods for accomplishing alignment, the following are my preferred methods on an Ubutntu distribution.

I used an example single-end .fastq file retrieved from NCBI’s Sequence Read Archive, with 2,294,708 reads spanning 29 genes.

WAIT… BEFORE YOU START!!!

It is extremely important to make sure that the reference genome you choose is compatible with the tools you are going to be using.

For example, GATK recommends GRCh37, which you can get from their ftp site.  The other popular option for the human genome is hg19 which UCSC genome browser uses.  The biggest difference between the two is CRCh37 .fasta files have a “>chr1” in their header, whereas hg19 .fasta’s have “>1” in their header.  (NCBI has recently updated the newest genome assembly GRCh38 as of Dec 2013)

The human genome is large, about 3.1 GB, so it is packaged in separate files, separated by chromosome.  Best practice is to combine them using the concatenate command in Linux shell.  Make sure you concatenate them sorted karyotypically not alphanumerically.  For Example:

kevin@kevin:~$ cd /path/to/ref_fasta/
kevin@kevin:~/path/to/ref_fasta$ cat chr1.fa chr2.fa chr3.fa ... chrX.fa chrY.fa chrMT.fa > GRCh37_sorted.fa

This produces a reference genome in .fasta format.  Go ahead and delete the individual chromosome .fasta files, they are no longer needed.

BWA

BWA is an alignment software recommended by GATK.  The sourceforge page including installation and documentation is found here.

Before the reference .fasta can be used, it must be indexed for speed considerations.  The following will index your reference fasta using bwa.  Be sure to use “-a bwtsw” which is an (-a)lgorithm for handling large genomes.

kevin@kevin:~$ bwa index -a bwtsw path/to/ref_fasta/reference.fa

Now the reads contained in .fastq files can be mapped to the reference genome.  There are three different algorithms included, each with their own pros and cons.

  1. bwa aln — samse/sampe
  2. bwasw
  3. bwa mem

Unpaired (or single-end) .fastq with default settings for each is shown below.  The following will generate a .sam file (sequence alignment/map format)
kevin@kevin:~$ bwa aln ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example.fastq > bwa_example.sai
kevin@kevin:~$ bwa samse ~/path/to/ref_fasta/ref.fasta ~path/to/sai/bwa_example.sai ~/path/to/fastq/example.fastq > example_aln_samse.sam


kevin@kevin:~$ bwa mem ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example.fastq > example_mem.sam

kevin@kevin:~$ bwasw ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example.fastq > example_sw.sam

BWA has several settings for advanced users that help align indels, trim reads, and in general, increase efficiency of mapping reads.  I performed alignment on the same .fastq file using several algorithms while recording time and accuracy for comparison. The results are in the graph below.  bwa-mem and bwasw achieved adequate results using default parameters, hence they were only used once.  bwa-aln achieved sub-optimal results with default parameters, so I tried tweaking some settings.  The purpose was to show that bwa-aln requires careful fine tuning and is often involved.

-e [integer] allows for a gap extension of up to [integer] base pairs, which can increase sensitivity of finding deletions. (default -1, which disallows long gaps)

-E [integer] allows the user to define the penalty for extending the gap (default 4)

This graph describes the time required and accuracy of each algorithm in BWA.  The algorithms are listed along the X axis with the settings chosen in parentheses.  Time is on Y1, in seconds.  Accuracy is depicted on Y2 as % Reads that successfully mapped to the reference genome.  Notice that bwa-aln is slower and less accurate than the newer bwa-mem and bwasw.
This graph describes the time required and accuracy of each algorithm in BWA. The algorithms are listed along the X axis with the settings chosen in parentheses. Time is on Y1, in seconds. Accuracy is depicted on Y2 as % Reads that successfully mapped to the reference genome. Notice that bwa-aln is slower and less accurate than the newer bwa-mem and bwasw.

 

Paired-end reads must be mapped individually, separated by forward and reverse fastq; also a different alignment algorithm is used (aln/sampe).  I did not show results from paired-end reads.
kevin@kevin:~$ bwa aln ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example_1.fastq > bwa_example_1.sai
kevin@kevin:~$ bwa aln ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example_2.fastq > bwa_example_2.sai
kevin@kevin:~$ bwa sampe ~/path/to/ref_fasta/ref.fasta ~path/to/sai/bwa_example_1.sai ~path/to/sai/bwa_example_2.sai ~/path/to/fastq/example_1.fastq ~/path/to/fastq/example_2.fastq > example.sam

Samtools & Integrated Genomics Viewer

Samtools will convert .sam files that  were generated in the previous steps to .bam files.  This site gives step by step instructions for converting from sam to bam, sorting and indexing a .bam.

One method of visualizing the reads in a .bam file is the IGV from Broad.  I’m not going to go into details, but the idea is:

  1. Choose the same reference genome that you used to index your reference genome from the previous steps.
  2. Upload your .bam file (index file “.bai” must be in same folder as .bam that you upload)
  3. Type in the gene of interest in the search bar.
  4. Zoom in to see the reads, they aren’t visible at low power.
IGV view of sequence reads in MSH6 gene using bwa-aln-samse with default settings.  Notice there are only 12 deletions.
IGV view of sequence reads in MSH6 gene using bwa-aln-samse with default settings. Notice there are only 12 deletions.
IGV view of sequencing reads in MSH6 gene in the exact location as above.  The -e 5 -E 1 settings which allow for deletions to map to the reference genome easier.  Notice compared to the default settings (12) there are
IGV view of sequencing reads in MSH6 gene in the exact location as above. The -e 5 -E 1 settings allow for deletions to map to the reference genome easier. Notice compared to the default settings (12) there are now 18 reads with deletions.

Every computer is set up a little differently invoking different usage and calls to the software.  Comments and suggestions welcome!