My Journey Beyond the Novelty of Personal Genomics

Last holiday season, my brother gave me a 23andMe kit as a Christmas present. Personal genomics kits are excellent gift ideas… if you know what to expect.  There are several companies that offer commercial “direct-to-consumer” whole genome sequencing at a fair price.  Since I knew 23andMe provided its customers with raw data, I was excited to get started immediately.  I spent some time on the 23andMe website, reading instructions for submitting my spit sample, finding out how long it would take before I had my results, and most importantly looking at what type of results I should expect.  My personalized 23andMe dashboard prompted me to answer a seemingly endless supply of “research” questions ranging from, “do you snore?” to ” do you have family history of asthma?”.  Weeks had gone by since submitting the sample, with the waiting my excitement gradually subsided.  But when I got an email alert that the results were available, I couldn’t wait to get home, login to my 23andMe dashboard and solve all the genetic mysteries that were now unlocked.

23andMe pros & cons

Pro: Ancestry Data

Unfortunately, I was a little underwhelmed with my results.  23andMe provides ancestry data and how closely related you are to Neanderthals.  But I already knew that my mom was 100% Irish and my dad was Polish and I don’t really care how closely related I am to Neanderthals!  However, for people with a diverse and/or unknown ancestry, a tool like this would provide the opportunity for hours of self-guided internet research.

Countries of Ancestry tool developed by: Mike Macpherson, Brian Naughton, Marcela Miyazawa
Countries of Ancestry tool developed by: Mike Macpherson, Brian Naughton, Marcela Miyazawa

Pro: DNA Relatives

In addition to the Ancestry data, one of the primary features that 23andMe offers is a social “network” of relatives who are also registered and genotyped on 23andMe.  Users have the option of sharing their genome with other users upon request as well as searching for DNA relatives.  As it turns out, it was exciting to see that one of my first cousins was already genotyped and we share 11.5% of our DNA.  As any user will quickly see, there are hundreds of “3rd – 5th cousins” who have been genotyped, you might recognize a user’s last name, or that small town in Maine where you know you have family.  This feature has helped individuals find family members, which can be controversial; I’m sure there are hundreds of untold stories that don’t end quite as happily, but it’s is a risk you take when getting your genome sequenced.

Visual representation of the shared regions of chromosomes [1-5] that my 1st cousin and I share.
Visual representation of the shared regions of chromosomes [1-5] that my 1st cousin and I share. Tool by: Lawrence Hon

Con: No phenotype info

The biggest reason I was underwhelmed was because there was so much data that 23andMe was housing, they knew everything about me from answering those survey questions and on top of that, they had my genome!  With hundreds of thousands  customers and presumably most of them answering the survey questions, surely 23andMe could figure out how to tell us which groups of people had blue eyes, or who was at risk for disease, right?  It’s the worlds biggest science experiment and its all data driven!!!  Scattered throughout the 23andMe website, you see leftover indicators that they were planning to offer this information to their customers at some point.  But it’s not their fault, they lost a fight against the FDA in providing “health-related” genetic tests, in summary, direct-to-consumer genetics is difficult to navigate because when customers learn that they are in the marginally at risk group for a certain disease, it causes an uninformed overreaction.  In the two and half years since 23andMe has made progress by offering a direct to consumer genetic test for Bloom Syndrome.  Expecting parents can get this test now to help determine if their unborn child is at risk for the disease.  The best way to think of this is like the test is now offered “over-the-counter”, you don’t have to visit a clinician to order the test.  This is a very clean and clear scenario though, often times genetic testing requires a heavy dose of human interpretation of highly complex results, so it’s understandable that the FDA didn’t want 23andMe unleashing this data to its customers, the world just isn’t ready.

Pro: API and Raw Data

On the heels of my disappointment with lack of interpretation of the data, I was more impressed that 23andMe offered an application programming interface (API for short) which allows 3rd party developers to build applications that sync with 23andMe’s customer’s data.  There are massive independent efforts out there that will provide some form of interpretation of your 23andMe results.  Some examples are SNPedia, and opensnp as well as a list of tools at 23andyou.  Customers that visit these sites get redirected to the official 23andMe sign-in page.  There is a nice disclaimer that releases them from liability and informs the user that their data will be viewed.  You have to pay for most of these services and I’m a bioinformatician, so I should be able to work on my own data… but now I have the tools to do so.

Beyond the novelty

SNPs, snps, snps

23andMe offers user’s raw data in the format of a list of ~600,000 SNPs (single nucleotide polymorphisms).  Each SNP has a unique identifier and the result is a genotype, which is simply two letters (one from mom, one from dad) it looks like AA, for example.  Here’s the first few lines from my results, then imagine ~600k of these SNPs.

First few snps from my 23andMe raw result file.
First few snps from my 23andMe raw result file.

SNPs are what make each of us different, they help explain the variance between different populations.  In the 2000’s scientists learned that SNPs were very powerful in characterizing large groups of people.  By performing whole genome arrays or sequencing on thousands of people, they could capture some significant correlations in the data.  These types of studies, called genome-wide-association-studies or GWAS, are popular today because the accuracy of the test has increased and the cost has significantly decreased.  If you think about it, 23andMe is currently doing the largest GWAS to date, and they’re publishing studies with impact on inherited diseases.

A single GWAS can take months to years to perform.  It typically contains two groups of people, one group that presents with a specific trait of interest, and one control group that doesn’t have the trait of interest.  The researchers could be looking at any trait ranging from serious diseases such as diabetes to more routine traits like hair color.  The idea is to genotype the SNPs for the entire study population and determine if there are SNPs that are exclusive to one population over the other.  Correlation doesn’t always imply causation, but with enough statistical power, there may be enough evidence to investigate certain SNPs in more detail.
Figure showing the mechanics of typical GWAS studies.

Someone at the National Human Genome Research Institute (NHGRI) thought it would be a good idea to have all these splintered GWAS in one central data warehouse and publically accessible.  What a great idea, it spawned a new database called the GWAS catalog curated by NHGRI, EMBL & EBI.  Weekly, it extracts data from published GWAS studies, for a grand total of more than 100,000 high quality, highly significant SNPs.  The best part is the data is free! It’s also easily accessible in the exact same format as the 23andMe data!  The GWAS catalog can be similarly accessed using their API.  This allows programmers the flexibility to build complex search queries and get real-time results.

My application

The GWAS catalog was the perfect opportunity to take advantage of my 23andMe results.  The GWAS catalog has a function to search for SNP associations by trait or disease, shown below.  For my first search I wanted to keep it light-hearted, but not trivial either, so I looked up baldness to see what SNPs, if any, contributed to male-pattern baldness.  Sure enough, there were 13 SNPs from 4 separate studies.  Two of the top hits are in the Androgen Receptor gene (AR), which is well known to be implicated in hair loss.

SNP association results from my
SNP association results from my “baldness” query.

We haven’t arrived at my application yet, anyone can go to the GWAS catalog and type in a disease or trait…  But I wanted to take a minute to comment on the search results from the figure above.  Notice you can filter several factors, including p-value and odds ratio.  GWAS are inherently prone to false positive association results just because of sheer numbers.  Statistics tell us that low p-values, lets say below 0.05 significance, means that  a SNP is associated with a trait in in 5% of cases just due to random sampling error.  So the lower the p-value, the more likely it is not associated by random chance.  The odds ratio quantifies how strongly a characteristic (trait) is seen in one population over another.  From two figures up, the “cases” group is 1.7 times more likely to have a “C” SNP than the “control” group.  In summary odds ratios close to 1.0 are usually inconclusive, as it indicates a 50/50 split.  The higher the odds ratio, the more likely a SNP is to be exclusive in one group over the other.  Now we have a list of SNPs associated with baldness, and some numbers to quantify how strong those associations are, cool.

It’d be nice if I could take one of the 13 SNPs and see if I’m more likely to go bald or not.  Currently, if I want to look up my SNPs, I’d have to type in each SNP one by one as shown below, and 23andMe would return my genotype for that SNP (AA or CT etc).  This method, as you can imagine, is pretty laborious, even for 13 SNPs.  If I search for celiac disease in the GWAS catalog, there are 90 SNPs, and diabetes has > 900, I’m not typing all those in.

23andMe raw data homepage. You can search several SNPs at once by using search by gene, or you can search for a single SNP using the rsID.
23andMe raw data homepage. You can search several SNPs at once by using search by gene, or you can search for a single SNP using the rsID.

This is the point when I decided to design an application that would allow me to investigate my own traits by tapping into the power of the GWAS catalog and marrying that with my 23andMe raw data.  I wanted this all to be done in an automated fashion without typing in each SNP one-by-one.  My application, it’s really a script, can be found on my GitHub page; if you know what your doing feel to use it.

Workflow of the application
Workflow of the application

The application starts by accepting a search term, which, in my case, was baldness, but it can be any trait or disease that the GWAS catalog has curated.  The application uses the GWAS API to return a list of SNPs associated with that search term while recording the risk allele, p-value, odds ratio, and number of study participants.  The risk allele determines which allele (A,T,C or G) is more prevalent in the risk group.

Depending on your search term there is a fairly long list of SNPs.  I filter out SNPs using a method described here in the “associations” section.  Basically if a study had less than 1,000 participants the statistics aren’t as convincing, also I filter so that only SNPs with a p-value < 5 * 10e-8 are included.  This filter is programmatically enforced during the first step.

Now that I have a list of high quality SNPs from the GWAS, my application automatically pulls your 23andMe data only for these SNPs.  You are prompted by 23andMe with a popup which will allow my application to pull your data.  So the application pulls the list of GWAS SNPs from your 23andMe data in the form of rs123456: AA, rs987654: AT, etc.  This is your list of SNPs associated with your search term.  Finally, the application compares your genotype to the risk allele.  It tells you whether you are normal, heterozygous, or homozygous for each SNP.

My results from the
My results from the “baldness” query

My application makes it easy to quickly investigate many of your own SNPs all at once, and gives you relevant information from a trusted source, the GWAS catalog.  What my application does not do, is try to interpret the results.  GWAS studies are controversial to begin with.  Also, many diseases, including cancer, are too complex to put a finger on one SNP change that will have an impact on predicting whether or not someone will be at risk for a disease.

I did this application as a side project, I wanted to see what I could do with my own 23andMe genotype data and more importantly used the opportunity to expand my skill set.  My next steps in the project will be to get the application on a webpage so anyone can visit and use it freely as well as try to get the results in a more meaningful format.

As always, thanks for reading, I appreciate any comments!

Developing Good Bioinformatics Habits — Extracting Exonic Variants from A Whole-Genome VCF

Beginning bioinformatics is a challenging task regardless of your prior experience.  There is a reason people refer to bioinformatics as a skill; like most skills, it requires a lot of practice.  Whether you interested in Next-Generation Sequencing (NGS), proteomics or text mining, chances are, pre-existing tools are available for each field.  NGS will serve as the example here, but the principles apply in any field.  This post is an account of my slow realization that bioinformatics skills are truly learned by watching professors, professionals, and Youtubers.

Bioinformatics tools on their own are easy enough to use.  Read the documentation, go to the command line and execute your desired function.  But this strategy goes against everything your being taught!  Efficiency and automation are lost by constantly running to the command line and typing similar tasks over and over.  So let’s go over a common problem in NGS data analysis pipelines and try to efficiently automate this process.

In general, we will follow these steps:

  1. Know the tools — Install the correct software if you haven’t already  (UCSC, tabix and bcftools)
  2. Work with the tools in the command line until you get desired results.
  3. Document your commands that execute properly (save the line in a text editor)
  4. Combine the tools into a script that can be executed to perform a trivial task.

An example problem

Before we start, let’s identify a problem and why it might be useful to save commands and scripts for this task.  If you work with NGS, results for mutation data are in stored in variant call format (VCF) files, developed to aid with the 1000 genomes project.  Your project is to compare mutation results of 500 individuals who had whole-genome sequencing performed.  To make life easier for us, the boss said we should start our investigation in the exonic regions.  The human exome consists the protein coding regions (about 1%) of the total genome, so as you can imagine there is a lot of extra mutation data in the VCF files that we can discard.  But how do we extract exonic regions from whole-genome files?  Then once we figure that out, are we going to have to type the long strings of commands for all 500 individuals?

Know The Tools

One step at a time…  First determine which tools we should use. The first tool only needs to be run one time for the project we are working on.

UCSC table browser is a powerful tool that can create custom .bed files.  These files allow investigators working with one gene, a set of genes, or in our case an entire exome, to obtain genomic coordinates for structurally important regions.  Following along with the link above and the red lines below, first make sure that “knownGene” is selected for table.  Next, select “BED” as output format and name the output file, make sure you select gzip if working with whole exome as the file will be large.  Select “get output” to move on to the next screen.

Red lines indicate the features that need to be selected.
Red lines indicate the features that need to be selected.

The next screen is straight-forward, simply select “Coding Exons” and then select “get BED“.

Remember to select "coding regions"
Remember to select “coding regions”

The resultant file will look like the image below.  bed files are required to have a minimum of chromosome number and genomic locations for start and stop.  Keep in mind that depending on downstream workflow you may need to toggle between “chr1” and “1” in the first column.  The file will probably go to your Downloads folder, save it in a folder that you will remember.

This is the first few lines of an example .bed file.
This is the first few lines of an example .bed file.

Now we have a .bed file that contains locations for every exon of every gene and several hundred VCF files containing millions of mutations!  Since these files are so large and cumbersome to work with, we will have to zip them using bgzip and index them with tabix, documentation here in the samtools distribution.  From here, a good VCF file to practice on would be the dbSNP VCF found here, called All.vcf.  Download it, and save it in folder where you will remember, preferably the same project folder where you save your .bed file.  The next step bcftools, which was developed to work with large VCF fies.   Some whole genome mutation VCF files have millions of mutations.

Work With the Tools & Document Your Commands

Working with the tools to get them to do what you want is an important step in automating processes.  Familiarize yourself with all the functions in the software and save the commands that execute successfully in a text file somewhere so you can come back to them for reference.

First we need to use tabix to index the file All.vcf.gz, which essentially allows software tools trying to access the file to find genomic locations quickly.  It’s pretty simple, let’s try it.  Remember my working directory (~/Biotechpulse) is where all the files I’m working on are located.


All.vcf.tbi is added with the above snippet.  Now that will allow us to work with bcftools to extract only the exonic regions from the whole genome dbsnp VCF.biotechpulse_code2

All_exon_only.vcf is the new file that contains a subset of all the variants from All.vcf.gz.   Now we have completed the task of extracting exonic variants from a whole genome VCF file.  Unfortunately, the time required to run these tools is too long to accomplish any project larger than just a few cases.

Combine the Tools

The final step in our task is to write a shell script that will automate this process.  You could write a script in any scripting language, but I will just reuse most of the command line strings for ease.  This is the most important step!  We have identified a problem that will reoccur on more than one occasion.  We familiarized ourselves with the software tools and documented the commands that executed properly.  Now if we want to perform this task again, lets set up our computer so we can easily do that.

First, create a folder for you shell scripts, or project specific like mine, all of my “NGS scripts” are in the same folder.  Next create the script below using a text editor and save it with name VCF_extract_exome and the file extension .sh if you are using linux shell.


Make the script executable (see red line):


To run the above script run the following from the command line:final_biotechpulse

The key to combining tools into usable scripts is to make the scripts scalable and dynamic.  An example of a dynamic script with respect to our example would be one that prompts the user for a different bed file.  This would be important if you are working with a particular gene set instead of an entire exome.  Most bioinformatics tools are developed with flexibility in mind and scripts that utilize those tools should be equally as flexible.

Thanks for reading, I hope these concepts help you develop a new way of thinking about solving bioinformatics problems!


Roots of Equations

Quick & dirty python scripts for finding roots of equations:

The functions are hard-coded in these examples, it’s more for the algorithm implementation…

Bisection Method  code

This piece of code answers 5.12 in Chapra’s Numerical Methods for Engineers.  Useful for finding the max between the integrals [0,1] for the following function: f(x) = -2x^6 – 1.6x^4 +12x +1

Results: 13 iterations to get below error of <5%  ;  final xr: 0.90478

f (x) = −2x^6 − 1.6x^4 + 12x + 1
f (x) = −2x^6 − 1.6x^4 + 12x + 1

Newton-Raphson method code

This answers Chapra’s 6.4.  You have to run this as many times as you think there are roots, which is why it’s a good idea to graph the function first.  I circled my initial guesses on the graph below for the hard-coded example: f(x) = -1 + 5.5x -4x^2 + 0.5x^3

The actual results of the prediction are:

x0:  0   after 4 iterations xroot:  0.21433

x0:  1   after 5 iterations xroot: 1.47976

x0:  6   after 4 iterations xroot: 6.30589


f(x) = -1 + 5.5x -4x^2 + 0.5x^3
f(x) = -1 + 5.5x -4x^2 + 0.5x^3

Mapping Next Generation Sequencing Reads to a Reference Genome: A comparison of algorithms in BWA

BWA as a Software Tool for DNA Sequence Alignment/Mapping Reads

This page serves as a working tutorial for myself as well as others.  I created this while learning the steps in the process for aligning reads.  These are my preferred installations, naming conventions, etc…

Aligning .fastq files to a reference genome is the first step in the biological sequence analysis pipeline.  There are several software tools available that perform the alignment.   In my experience these tools are easier to use and typically intended to be installed on unix-like operating systems.  The tools are accessed via the command line, so familiarize yourself with the structure of passing arguments and options.

There are several methods for accomplishing alignment, the following are my preferred methods on an Ubutntu distribution.

I used an example single-end .fastq file retrieved from NCBI’s Sequence Read Archive, with 2,294,708 reads spanning 29 genes.


It is extremely important to make sure that the reference genome you choose is compatible with the tools you are going to be using.

For example, GATK recommends GRCh37, which you can get from their ftp site.  The other popular option for the human genome is hg19 which UCSC genome browser uses.  The biggest difference between the two is CRCh37 .fasta files have a “>chr1” in their header, whereas hg19 .fasta’s have “>1” in their header.  (NCBI has recently updated the newest genome assembly GRCh38 as of Dec 2013)

The human genome is large, about 3.1 GB, so it is packaged in separate files, separated by chromosome.  Best practice is to combine them using the concatenate command in Linux shell.  Make sure you concatenate them sorted karyotypically not alphanumerically.  For Example:

kevin@kevin:~$ cd /path/to/ref_fasta/
kevin@kevin:~/path/to/ref_fasta$ cat chr1.fa chr2.fa chr3.fa ... chrX.fa chrY.fa chrMT.fa > GRCh37_sorted.fa

This produces a reference genome in .fasta format.  Go ahead and delete the individual chromosome .fasta files, they are no longer needed.


BWA is an alignment software recommended by GATK.  The sourceforge page including installation and documentation is found here.

Before the reference .fasta can be used, it must be indexed for speed considerations.  The following will index your reference fasta using bwa.  Be sure to use “-a bwtsw” which is an (-a)lgorithm for handling large genomes.

kevin@kevin:~$ bwa index -a bwtsw path/to/ref_fasta/reference.fa

Now the reads contained in .fastq files can be mapped to the reference genome.  There are three different algorithms included, each with their own pros and cons.

  1. bwa aln — samse/sampe
  2. bwasw
  3. bwa mem

Unpaired (or single-end) .fastq with default settings for each is shown below.  The following will generate a .sam file (sequence alignment/map format)
kevin@kevin:~$ bwa aln ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example.fastq > bwa_example.sai
kevin@kevin:~$ bwa samse ~/path/to/ref_fasta/ref.fasta ~path/to/sai/bwa_example.sai ~/path/to/fastq/example.fastq > example_aln_samse.sam

kevin@kevin:~$ bwa mem ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example.fastq > example_mem.sam

kevin@kevin:~$ bwasw ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example.fastq > example_sw.sam

BWA has several settings for advanced users that help align indels, trim reads, and in general, increase efficiency of mapping reads.  I performed alignment on the same .fastq file using several algorithms while recording time and accuracy for comparison. The results are in the graph below.  bwa-mem and bwasw achieved adequate results using default parameters, hence they were only used once.  bwa-aln achieved sub-optimal results with default parameters, so I tried tweaking some settings.  The purpose was to show that bwa-aln requires careful fine tuning and is often involved.

-e [integer] allows for a gap extension of up to [integer] base pairs, which can increase sensitivity of finding deletions. (default -1, which disallows long gaps)

-E [integer] allows the user to define the penalty for extending the gap (default 4)

This graph describes the time required and accuracy of each algorithm in BWA.  The algorithms are listed along the X axis with the settings chosen in parentheses.  Time is on Y1, in seconds.  Accuracy is depicted on Y2 as % Reads that successfully mapped to the reference genome.  Notice that bwa-aln is slower and less accurate than the newer bwa-mem and bwasw.
This graph describes the time required and accuracy of each algorithm in BWA. The algorithms are listed along the X axis with the settings chosen in parentheses. Time is on Y1, in seconds. Accuracy is depicted on Y2 as % Reads that successfully mapped to the reference genome. Notice that bwa-aln is slower and less accurate than the newer bwa-mem and bwasw.


Paired-end reads must be mapped individually, separated by forward and reverse fastq; also a different alignment algorithm is used (aln/sampe).  I did not show results from paired-end reads.
kevin@kevin:~$ bwa aln ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example_1.fastq > bwa_example_1.sai
kevin@kevin:~$ bwa aln ~/path/to/ref_fasta/ref.fasta ~/path/to/fastq/example_2.fastq > bwa_example_2.sai
kevin@kevin:~$ bwa sampe ~/path/to/ref_fasta/ref.fasta ~path/to/sai/bwa_example_1.sai ~path/to/sai/bwa_example_2.sai ~/path/to/fastq/example_1.fastq ~/path/to/fastq/example_2.fastq > example.sam

Samtools & Integrated Genomics Viewer

Samtools will convert .sam files that  were generated in the previous steps to .bam files.  This site gives step by step instructions for converting from sam to bam, sorting and indexing a .bam.

One method of visualizing the reads in a .bam file is the IGV from Broad.  I’m not going to go into details, but the idea is:

  1. Choose the same reference genome that you used to index your reference genome from the previous steps.
  2. Upload your .bam file (index file “.bai” must be in same folder as .bam that you upload)
  3. Type in the gene of interest in the search bar.
  4. Zoom in to see the reads, they aren’t visible at low power.
IGV view of sequence reads in MSH6 gene using bwa-aln-samse with default settings.  Notice there are only 12 deletions.
IGV view of sequence reads in MSH6 gene using bwa-aln-samse with default settings. Notice there are only 12 deletions.
IGV view of sequencing reads in MSH6 gene in the exact location as above.  The -e 5 -E 1 settings which allow for deletions to map to the reference genome easier.  Notice compared to the default settings (12) there are
IGV view of sequencing reads in MSH6 gene in the exact location as above. The -e 5 -E 1 settings allow for deletions to map to the reference genome easier. Notice compared to the default settings (12) there are now 18 reads with deletions.

Every computer is set up a little differently invoking different usage and calls to the software.  Comments and suggestions welcome!

Becoming an Informed Consumer of Personalized Medicine

Patients often put trust in doctors’ to make the right decisions for them.  I recently re-watched Dr. Jon Cohen’s TEDMED talk and heard him accurately describe Americans’ passion and excitement for scalping the best deal on a Samsung or Vizio HDTV.  We know contrast ratios, pros & cons of LED vs LCD, and where to find it at the best price!  But, when it comes to your upcoming medical procedure, do you know your doctor’s complication rate for that procedure?  Are you the guinea pig for their trial run with new surgical equipment?  Dr. Cohen’s thoughts have inspired me to push the envelope further,

I’m going to put even more pressure on American healthcare consumers.  Personalized Medicine (PM) is the buzz word(s) in the healthcare industry today, which should have all of us jumping for joy!  It is bringing new treatment strategies, earlier disease detection and most importantly, risk assessment that we previously lacked.  To Dr. Cohen’s point though, how many consumers really know what personalized medicine is?  It isn’t tangible, you can’t go buy it off the shelf in a pharmacy and it isn’t a “personal” greeter when you walk through the door at the doctors’ office.  In order to understand the concept of PM, one must first understand “Big Data“.


Here’s an opportunity to tackle another intimidating concept that will DOMINATE the future of healthcare.  Pharmaceutical companies (the ones running clinical trials) are getting this right and they will serve as an adequate example of Big Data in action.  Let’s say they have 100 patients from all across America enrolled in a trial.  In order to enroll for a trial, you essentially have to give up an autobiography.  For each of those patients, they record height, weight, hair color, eye color and for some crazy reason…what the patients had for lunch today!  Stage 1 complete, we have all 100 patients with all their clinical data recorded in an excel file or a database.

Let’s also say the Pharma company sequenced two genes for all of the patients (not a small task).  Now we’ve added a lot more excel files to the database.  Here’s where the beauty and simplicity of big data exists.  A scientist performs pattern recognition, also known as analytics, on the data and finds the following:  The 30 people who ate cheeseburgers all have a mutation in Gene A, and the 20 people that ate chicken wings all have a mutation in gene B!  Bazinga!  We’ve found our cause of mutation for two genes (we think).  The bad news is that its more complicated than this simple example.

That example seemed pretty easy, but it wasn’t exactly Big Data.  They only studied 100 patients and only a couple genes.  Real life examples of Big Data in the pharma industry typically include hundreds or thousands of patients all with hundreds of data elements (vitals) and dozens or more genes sequenced.  Completing just one clinical trial is starting to sound like a monumental task, think of all the terabytes of data they are going through, which is why it takes years to complete.  Data elements/vitals that we are concerned with are typically harder to quantify than age, height and weight.  The emerging categories are things like dietary & sleeping habits and exposure to environmental carcinogens like smog or length of time in the sun every day.  The measurements from these categories and gene mutation data are all  wrapped up in many glorified 100,000 row excel tables… for one patient.  Now we’re talking Big Data.  Here’s where the value of bioinformatics shines through.  It’s someone’s job to…. no, no, no, not count the mutations row by row; to write a computer program that will count the mutations, then the program will count similarities across all the patients to try and glean some patterns in the data.  There are thousands of mutations in one person’s genome.  Because there is so much information contained in even one gene, personalized medicine is handcuffed to Big Data.  See my previous post about why it’s important for biologists and doctors to play nice with computer scientists…


We’ve fulfilled part of Dr. Cohen’s call, now we are more “informed” about Big Data, which is the engine that powers PM.  I used gene sequencing as the example above because PM relies heavily on gene sequencing.  Genes provide the instructions for protein production in the human body.  When you have a mutation in your gene it’s be like missing step number 7 when building your IKEA couch.  You just have to guess how to put the couch together and one of three things would happen.  1. You’re smart and the couch turns out fine.  2. The couch will be a little off, slumping in the corner.  3. Silly, you can’t put your couch together without step number 7, it never gets built!  The same thing happens in the human body with mutations.  We acquire mutations occasionally, sometimes they have no effect and sometimes they are deleterious to our proteins.  These mutations can cause cancer, make you have pointy ears, or give you immunity to malaria infection.

Advances in gene sequencing are making our genomes more accessible than ever.  Some of our genes are well annotated, like the BRCA genes for example.  Research has shown that a mutation in the BRCA gene predisposes an individual to breast and ovarian cancer: see Angelina Jolie’s inspiring story in the NY Times.  She went through with that procedure because they found a mutation in her BRCA gene, not because doctor’s found a tumor, not because she was feeling ill.  Whether you agree with her decision or not, you can’t deny the  importance of knowing what information is contained in your genes.

The personalized medicine utopia would be to have every gene sequenced on every human being.  We’re not even close, but we are making strides.  Let’s go back to our simple example about cheeseburgers and chicken wings.  What I didn’t tell you is that those 100 people enrolled in the clinical trial all complained about a rash on their arm.  They ALL were prescribed the SAME drug to clear it up.  Three months later we find out that the drug worked on everybody except the people that ate chicken wings, a mutation in gene B causes drug resistance.  If you extended this approach to real medicine, not my silly examples, some important discoveries can be made by using association mapping from individuals with similar gene signatures.  Because we’re taking the time and effort to record sometimes seemingly unnecessary data, we have the ability to make these associations.


We are poor consumers of “regular” medicine as Dr. Cohen reminded us, however we still have a chance to redeem ourselves.  Studies are published every day about characterizing sub-populations by these things called “snp”s (single nucleotide polymorphisms).  Humans naturally have variation in our genome, whether its your twin sister or some guy from an eastern European country on the metro, your genomes will still be different.  SNPs aren’t exactly mutations, but they do represent a difference in genes from one individual to another.  Because of Big Data and personalized medicine we are beginning to characterize entire populations of snps, which is very exciting.

People are out there getting parts of their genomes sequenced for $100 from companies like 23&me.  Customers get results that tell them what SNPs they have, which is how 23&me figures out your ancestry and crime scene investigators confirm identities.
SNPs represent natural variation in the human genome between unique individuals. SNPs are not mutations.

Certain ethnic populations carry similar SNPs .  Research is constantly being done on snps, papers are published daily that inform people what it means to carry a certain snp.  23&me customers can go see if they have that snp in that particular gene and determine whether or not the research means anything to them.  An example from a couple months ago identified a snp that causes people to have a longer ring finger than middle finger.  Here’s the link to their page outlining newly discovered snps every month.   At this point the results are trivial in the medical realm,  they revolve around ancestry, diet and lifestyle.  However, SNPs are equally as important to understand as genetic mutations as they can have similar consequences.  What if, one day, a study finds that all long ring finger SNP people are good candidates for a drug that cures diabetes.  Hopefully you have that SNP, but if you don’t get your genome sequenced, you may never know.

I’m not saying that everyone should go get there genome sequenced (I haven’t done it yet)!  However, thanks to advances in handling Big Data, research on genes, whole genomes, snps, and mutations are moving at light speed.  So many medical decisions are being made depending on the information in your genes.  It only takes a day to sequence one gene, and another day for your doctor to interpret the results and prescribe treatment.  The FDA is with academia and medical labs step for step on this movement.  Insurance companies are hiring experts to deal with the risk implications of genetic predispositions.  Personalized Medicine is here, it’s a hopeful, inspiring and exciting time to be a consumer of it!

Thanks for reading,


Disclaimer:  To keep with the theme of staying informed, I encourage you to simply google Big Data and Personalized Medicine to get more information.  This is one person’s conceptual approach, I am not the authority on either of these topics.

Image credit:
picture above
Steve Baker, @bakerture,

The New Kid! Bioinformatics as an Emerging Science

image credit:

It’s tough being the new kid.  They often have trouble finding where they fit within the pre-existing social community.  Over time though, their character shines through, reputation grows and they settle into their niche.  The young field of bioinformatics knows the feeling, straddling the line between biology and computer science.  Although most computational biologists (bioinformaticians) have a background in one of the two sciences, they still have trouble “fitting in” with either group.

Venn Diagram shows the overlap of two well established sciences to form a new field of study.
Venn Diagram shows the overlap of two well established sciences to form a new field of study.

Bioinformatics, also known as computational biology, is the study of biology by exploiting the immense computational power that is available today.  It was a science born out of NEED!  There are approximately 37 trillion cells in the human body.  Do we think a biologist sat around and counted those?  Of course not, that would literally take forever.  There are complex mathematical formulas based on existing data that help us arrive at that number.  Bioinformatics’ crowning achievement thus far occurred about a decade ago when the Human Genome Project was completed (April 2003).  The goal, when funded in 1990, was to sequence the entire human genome.  We learned that there are approximately 30,000 genes and 3.16 billion base pairs in one human genome.   The completion of the HGP ignited a curiosity in scientists all over the world and the birth of “Big Data” just occurred in biology.  A graph from the National Center for Biotechnology Information (NCBI) shows the growth of number of genes sequenced over time and the number of whole genomes sequenced (WGS) over time.  Hidden within those sequences are personalized cancer therapies, disease resistance/predisposition and the key to aging.  We just need computational biologists to unlock those secrets.

Why is Bioinformatics under-appreciated in the scientific community?  The answer is two-fold:  fear and the ability to communicate.  Biologists are scared to trust assumptions, computer models and mathematical formulas that computer scientists create.  Also, computer science programmers aren’t trained in biological methods so they do not understand the biology-speak.  This problem left a NEED for people who are trained as both biologists and computer scientists.

Two experiences from my bioinformatics graduate education highlight the problems perfectly.  I studied biology as an undergraduate, so that is where I feel the most comfortable. However, in bioinformatics, you have to be an expert in both biology AND computer science.  Last year, taking an introductory computer science course (deer in the headlights, complete amateur), the topic of  “global variables” came up and I didn’t understand it at first.  I remember asking a graduate student who taught lab section of the course to help explain and he laughed at me, saying that “if I didn’t understand this topic, I was in the wrong program”.  At first, I was discouraged, and I instantly understood why bioinformatics is so hard to “stick”.  As a biologist, you really have to grind through those introductory weeks, months, years… of learning to program.  The second experience made me feel ashamed of biologists.  In the introductory biology course, there were several computer scientists enrolled.  Every week it was a new question about why adenine base paired with thymine or what the difference between the five prime and three prime end of  a DNA strand was.  Stuff I could recite in my sleep.  I felt myself judging them, but quickly remembered the feeling I had when I was in “their” realm, across campus, in the computer lab.  So, I believe patience will be important for bioinformatics to continue to succeed and thrive as a science.

Bioinformatics is here to stay.  I propose a happy marriage moving forward, as science and technology grow together.  Computer scientists should embrace biologists and be willing to teach them the tricks of the trade, not frown at them when they don’t know how to write a block of code “recursively”.  On the other side of that same coin, biologists should show respect and guide computer scientists down the path of good experimental design, rather than pointing out “obvious” flaws in mathematical formulas.  Studying bioinformatics and computational biology has been one of the most rewarding, eye-opening endeavors in my life.  I look forward to the challenges ahead as well as the chance to educate the next generation of computational biologists.

If you’ve made it this far, you’ve read my FIRST BLOG EVER!  Thank you so much for reading, please leave comments, I’m interested to hear what you have to say.

Kevin Arvai

Use the comments link above the article to comment.