My Journey Beyond the Novelty of Personal Genomics

Last holiday season, my brother gave me a 23andMe kit as a Christmas present. Personal genomics kits are excellent gift ideas… if you know what to expect.  There are several companies that offer commercial “direct-to-consumer” whole genome sequencing at a fair price.  Since I knew 23andMe provided its customers with raw data, I was excited to get started immediately.  I spent some time on the 23andMe website, reading instructions for submitting my spit sample, finding out how long it would take before I had my results, and most importantly looking at what type of results I should expect.  My personalized 23andMe dashboard prompted me to answer a seemingly endless supply of “research” questions ranging from, “do you snore?” to ” do you have family history of asthma?”.  Weeks had gone by since submitting the sample, with the waiting my excitement gradually subsided.  But when I got an email alert that the results were available, I couldn’t wait to get home, login to my 23andMe dashboard and solve all the genetic mysteries that were now unlocked.

23andMe pros & cons

Pro: Ancestry Data

Unfortunately, I was a little underwhelmed with my results.  23andMe provides ancestry data and how closely related you are to Neanderthals.  But I already knew that my mom was 100% Irish and my dad was Polish and I don’t really care how closely related I am to Neanderthals!  However, for people with a diverse and/or unknown ancestry, a tool like this would provide the opportunity for hours of self-guided internet research.

Countries of Ancestry tool developed by: Mike Macpherson, Brian Naughton, Marcela Miyazawa
Countries of Ancestry tool developed by: Mike Macpherson, Brian Naughton, Marcela Miyazawa

Pro: DNA Relatives

In addition to the Ancestry data, one of the primary features that 23andMe offers is a social “network” of relatives who are also registered and genotyped on 23andMe.  Users have the option of sharing their genome with other users upon request as well as searching for DNA relatives.  As it turns out, it was exciting to see that one of my first cousins was already genotyped and we share 11.5% of our DNA.  As any user will quickly see, there are hundreds of “3rd – 5th cousins” who have been genotyped, you might recognize a user’s last name, or that small town in Maine where you know you have family.  This feature has helped individuals find family members, which can be controversial; I’m sure there are hundreds of untold stories that don’t end quite as happily, but it’s is a risk you take when getting your genome sequenced.

Visual representation of the shared regions of chromosomes [1-5] that my 1st cousin and I share.
Visual representation of the shared regions of chromosomes [1-5] that my 1st cousin and I share. Tool by: Lawrence Hon

Con: No phenotype info

The biggest reason I was underwhelmed was because there was so much data that 23andMe was housing, they knew everything about me from answering those survey questions and on top of that, they had my genome!  With hundreds of thousands  customers and presumably most of them answering the survey questions, surely 23andMe could figure out how to tell us which groups of people had blue eyes, or who was at risk for disease, right?  It’s the worlds biggest science experiment and its all data driven!!!  Scattered throughout the 23andMe website, you see leftover indicators that they were planning to offer this information to their customers at some point.  But it’s not their fault, they lost a fight against the FDA in providing “health-related” genetic tests, in summary, direct-to-consumer genetics is difficult to navigate because when customers learn that they are in the marginally at risk group for a certain disease, it causes an uninformed overreaction.  In the two and half years since 23andMe has made progress by offering a direct to consumer genetic test for Bloom Syndrome.  Expecting parents can get this test now to help determine if their unborn child is at risk for the disease.  The best way to think of this is like the test is now offered “over-the-counter”, you don’t have to visit a clinician to order the test.  This is a very clean and clear scenario though, often times genetic testing requires a heavy dose of human interpretation of highly complex results, so it’s understandable that the FDA didn’t want 23andMe unleashing this data to its customers, the world just isn’t ready.

Pro: API and Raw Data

On the heels of my disappointment with lack of interpretation of the data, I was more impressed that 23andMe offered an application programming interface (API for short) which allows 3rd party developers to build applications that sync with 23andMe’s customer’s data.  There are massive independent efforts out there that will provide some form of interpretation of your 23andMe results.  Some examples are SNPedia, and opensnp as well as a list of tools at 23andyou.  Customers that visit these sites get redirected to the official 23andMe sign-in page.  There is a nice disclaimer that releases them from liability and informs the user that their data will be viewed.  You have to pay for most of these services and I’m a bioinformatician, so I should be able to work on my own data… but now I have the tools to do so.

Beyond the novelty

SNPs, snps, snps

23andMe offers user’s raw data in the format of a list of ~600,000 SNPs (single nucleotide polymorphisms).  Each SNP has a unique identifier and the result is a genotype, which is simply two letters (one from mom, one from dad) it looks like AA, for example.  Here’s the first few lines from my results, then imagine ~600k of these SNPs.

First few snps from my 23andMe raw result file.
First few snps from my 23andMe raw result file.

SNPs are what make each of us different, they help explain the variance between different populations.  In the 2000’s scientists learned that SNPs were very powerful in characterizing large groups of people.  By performing whole genome arrays or sequencing on thousands of people, they could capture some significant correlations in the data.  These types of studies, called genome-wide-association-studies or GWAS, are popular today because the accuracy of the test has increased and the cost has significantly decreased.  If you think about it, 23andMe is currently doing the largest GWAS to date, and they’re publishing studies with impact on inherited diseases.

A single GWAS can take months to years to perform.  It typically contains two groups of people, one group that presents with a specific trait of interest, and one control group that doesn’t have the trait of interest.  The researchers could be looking at any trait ranging from serious diseases such as diabetes to more routine traits like hair color.  The idea is to genotype the SNPs for the entire study population and determine if there are SNPs that are exclusive to one population over the other.  Correlation doesn’t always imply causation, but with enough statistical power, there may be enough evidence to investigate certain SNPs in more detail.
Figure showing the mechanics of typical GWAS studies.

Someone at the National Human Genome Research Institute (NHGRI) thought it would be a good idea to have all these splintered GWAS in one central data warehouse and publically accessible.  What a great idea, it spawned a new database called the GWAS catalog curated by NHGRI, EMBL & EBI.  Weekly, it extracts data from published GWAS studies, for a grand total of more than 100,000 high quality, highly significant SNPs.  The best part is the data is free! It’s also easily accessible in the exact same format as the 23andMe data!  The GWAS catalog can be similarly accessed using their API.  This allows programmers the flexibility to build complex search queries and get real-time results.

My application

The GWAS catalog was the perfect opportunity to take advantage of my 23andMe results.  The GWAS catalog has a function to search for SNP associations by trait or disease, shown below.  For my first search I wanted to keep it light-hearted, but not trivial either, so I looked up baldness to see what SNPs, if any, contributed to male-pattern baldness.  Sure enough, there were 13 SNPs from 4 separate studies.  Two of the top hits are in the Androgen Receptor gene (AR), which is well known to be implicated in hair loss.

SNP association results from my
SNP association results from my “baldness” query.

We haven’t arrived at my application yet, anyone can go to the GWAS catalog and type in a disease or trait…  But I wanted to take a minute to comment on the search results from the figure above.  Notice you can filter several factors, including p-value and odds ratio.  GWAS are inherently prone to false positive association results just because of sheer numbers.  Statistics tell us that low p-values, lets say below 0.05 significance, means that  a SNP is associated with a trait in in 5% of cases just due to random sampling error.  So the lower the p-value, the more likely it is not associated by random chance.  The odds ratio quantifies how strongly a characteristic (trait) is seen in one population over another.  From two figures up, the “cases” group is 1.7 times more likely to have a “C” SNP than the “control” group.  In summary odds ratios close to 1.0 are usually inconclusive, as it indicates a 50/50 split.  The higher the odds ratio, the more likely a SNP is to be exclusive in one group over the other.  Now we have a list of SNPs associated with baldness, and some numbers to quantify how strong those associations are, cool.

It’d be nice if I could take one of the 13 SNPs and see if I’m more likely to go bald or not.  Currently, if I want to look up my SNPs, I’d have to type in each SNP one by one as shown below, and 23andMe would return my genotype for that SNP (AA or CT etc).  This method, as you can imagine, is pretty laborious, even for 13 SNPs.  If I search for celiac disease in the GWAS catalog, there are 90 SNPs, and diabetes has > 900, I’m not typing all those in.

23andMe raw data homepage. You can search several SNPs at once by using search by gene, or you can search for a single SNP using the rsID.
23andMe raw data homepage. You can search several SNPs at once by using search by gene, or you can search for a single SNP using the rsID.

This is the point when I decided to design an application that would allow me to investigate my own traits by tapping into the power of the GWAS catalog and marrying that with my 23andMe raw data.  I wanted this all to be done in an automated fashion without typing in each SNP one-by-one.  My application, it’s really a script, can be found on my GitHub page; if you know what your doing feel to use it.

Workflow of the application
Workflow of the application

The application starts by accepting a search term, which, in my case, was baldness, but it can be any trait or disease that the GWAS catalog has curated.  The application uses the GWAS API to return a list of SNPs associated with that search term while recording the risk allele, p-value, odds ratio, and number of study participants.  The risk allele determines which allele (A,T,C or G) is more prevalent in the risk group.

Depending on your search term there is a fairly long list of SNPs.  I filter out SNPs using a method described here in the “associations” section.  Basically if a study had less than 1,000 participants the statistics aren’t as convincing, also I filter so that only SNPs with a p-value < 5 * 10e-8 are included.  This filter is programmatically enforced during the first step.

Now that I have a list of high quality SNPs from the GWAS, my application automatically pulls your 23andMe data only for these SNPs.  You are prompted by 23andMe with a popup which will allow my application to pull your data.  So the application pulls the list of GWAS SNPs from your 23andMe data in the form of rs123456: AA, rs987654: AT, etc.  This is your list of SNPs associated with your search term.  Finally, the application compares your genotype to the risk allele.  It tells you whether you are normal, heterozygous, or homozygous for each SNP.

My results from the
My results from the “baldness” query

My application makes it easy to quickly investigate many of your own SNPs all at once, and gives you relevant information from a trusted source, the GWAS catalog.  What my application does not do, is try to interpret the results.  GWAS studies are controversial to begin with.  Also, many diseases, including cancer, are too complex to put a finger on one SNP change that will have an impact on predicting whether or not someone will be at risk for a disease.

I did this application as a side project, I wanted to see what I could do with my own 23andMe genotype data and more importantly used the opportunity to expand my skill set.  My next steps in the project will be to get the application on a webpage so anyone can visit and use it freely as well as try to get the results in a more meaningful format.

As always, thanks for reading, I appreciate any comments!


Becoming an Informed Consumer of Personalized Medicine

Patients often put trust in doctors’ to make the right decisions for them.  I recently re-watched Dr. Jon Cohen’s TEDMED talk and heard him accurately describe Americans’ passion and excitement for scalping the best deal on a Samsung or Vizio HDTV.  We know contrast ratios, pros & cons of LED vs LCD, and where to find it at the best price!  But, when it comes to your upcoming medical procedure, do you know your doctor’s complication rate for that procedure?  Are you the guinea pig for their trial run with new surgical equipment?  Dr. Cohen’s thoughts have inspired me to push the envelope further,

I’m going to put even more pressure on American healthcare consumers.  Personalized Medicine (PM) is the buzz word(s) in the healthcare industry today, which should have all of us jumping for joy!  It is bringing new treatment strategies, earlier disease detection and most importantly, risk assessment that we previously lacked.  To Dr. Cohen’s point though, how many consumers really know what personalized medicine is?  It isn’t tangible, you can’t go buy it off the shelf in a pharmacy and it isn’t a “personal” greeter when you walk through the door at the doctors’ office.  In order to understand the concept of PM, one must first understand “Big Data“.


Here’s an opportunity to tackle another intimidating concept that will DOMINATE the future of healthcare.  Pharmaceutical companies (the ones running clinical trials) are getting this right and they will serve as an adequate example of Big Data in action.  Let’s say they have 100 patients from all across America enrolled in a trial.  In order to enroll for a trial, you essentially have to give up an autobiography.  For each of those patients, they record height, weight, hair color, eye color and for some crazy reason…what the patients had for lunch today!  Stage 1 complete, we have all 100 patients with all their clinical data recorded in an excel file or a database.

Let’s also say the Pharma company sequenced two genes for all of the patients (not a small task).  Now we’ve added a lot more excel files to the database.  Here’s where the beauty and simplicity of big data exists.  A scientist performs pattern recognition, also known as analytics, on the data and finds the following:  The 30 people who ate cheeseburgers all have a mutation in Gene A, and the 20 people that ate chicken wings all have a mutation in gene B!  Bazinga!  We’ve found our cause of mutation for two genes (we think).  The bad news is that its more complicated than this simple example.

That example seemed pretty easy, but it wasn’t exactly Big Data.  They only studied 100 patients and only a couple genes.  Real life examples of Big Data in the pharma industry typically include hundreds or thousands of patients all with hundreds of data elements (vitals) and dozens or more genes sequenced.  Completing just one clinical trial is starting to sound like a monumental task, think of all the terabytes of data they are going through, which is why it takes years to complete.  Data elements/vitals that we are concerned with are typically harder to quantify than age, height and weight.  The emerging categories are things like dietary & sleeping habits and exposure to environmental carcinogens like smog or length of time in the sun every day.  The measurements from these categories and gene mutation data are all  wrapped up in many glorified 100,000 row excel tables… for one patient.  Now we’re talking Big Data.  Here’s where the value of bioinformatics shines through.  It’s someone’s job to…. no, no, no, not count the mutations row by row; to write a computer program that will count the mutations, then the program will count similarities across all the patients to try and glean some patterns in the data.  There are thousands of mutations in one person’s genome.  Because there is so much information contained in even one gene, personalized medicine is handcuffed to Big Data.  See my previous post about why it’s important for biologists and doctors to play nice with computer scientists…


We’ve fulfilled part of Dr. Cohen’s call, now we are more “informed” about Big Data, which is the engine that powers PM.  I used gene sequencing as the example above because PM relies heavily on gene sequencing.  Genes provide the instructions for protein production in the human body.  When you have a mutation in your gene it’s be like missing step number 7 when building your IKEA couch.  You just have to guess how to put the couch together and one of three things would happen.  1. You’re smart and the couch turns out fine.  2. The couch will be a little off, slumping in the corner.  3. Silly, you can’t put your couch together without step number 7, it never gets built!  The same thing happens in the human body with mutations.  We acquire mutations occasionally, sometimes they have no effect and sometimes they are deleterious to our proteins.  These mutations can cause cancer, make you have pointy ears, or give you immunity to malaria infection.

Advances in gene sequencing are making our genomes more accessible than ever.  Some of our genes are well annotated, like the BRCA genes for example.  Research has shown that a mutation in the BRCA gene predisposes an individual to breast and ovarian cancer: see Angelina Jolie’s inspiring story in the NY Times.  She went through with that procedure because they found a mutation in her BRCA gene, not because doctor’s found a tumor, not because she was feeling ill.  Whether you agree with her decision or not, you can’t deny the  importance of knowing what information is contained in your genes.

The personalized medicine utopia would be to have every gene sequenced on every human being.  We’re not even close, but we are making strides.  Let’s go back to our simple example about cheeseburgers and chicken wings.  What I didn’t tell you is that those 100 people enrolled in the clinical trial all complained about a rash on their arm.  They ALL were prescribed the SAME drug to clear it up.  Three months later we find out that the drug worked on everybody except the people that ate chicken wings, a mutation in gene B causes drug resistance.  If you extended this approach to real medicine, not my silly examples, some important discoveries can be made by using association mapping from individuals with similar gene signatures.  Because we’re taking the time and effort to record sometimes seemingly unnecessary data, we have the ability to make these associations.


We are poor consumers of “regular” medicine as Dr. Cohen reminded us, however we still have a chance to redeem ourselves.  Studies are published every day about characterizing sub-populations by these things called “snp”s (single nucleotide polymorphisms).  Humans naturally have variation in our genome, whether its your twin sister or some guy from an eastern European country on the metro, your genomes will still be different.  SNPs aren’t exactly mutations, but they do represent a difference in genes from one individual to another.  Because of Big Data and personalized medicine we are beginning to characterize entire populations of snps, which is very exciting.

People are out there getting parts of their genomes sequenced for $100 from companies like 23&me.  Customers get results that tell them what SNPs they have, which is how 23&me figures out your ancestry and crime scene investigators confirm identities.
SNPs represent natural variation in the human genome between unique individuals. SNPs are not mutations.

Certain ethnic populations carry similar SNPs .  Research is constantly being done on snps, papers are published daily that inform people what it means to carry a certain snp.  23&me customers can go see if they have that snp in that particular gene and determine whether or not the research means anything to them.  An example from a couple months ago identified a snp that causes people to have a longer ring finger than middle finger.  Here’s the link to their page outlining newly discovered snps every month.   At this point the results are trivial in the medical realm,  they revolve around ancestry, diet and lifestyle.  However, SNPs are equally as important to understand as genetic mutations as they can have similar consequences.  What if, one day, a study finds that all long ring finger SNP people are good candidates for a drug that cures diabetes.  Hopefully you have that SNP, but if you don’t get your genome sequenced, you may never know.

I’m not saying that everyone should go get there genome sequenced (I haven’t done it yet)!  However, thanks to advances in handling Big Data, research on genes, whole genomes, snps, and mutations are moving at light speed.  So many medical decisions are being made depending on the information in your genes.  It only takes a day to sequence one gene, and another day for your doctor to interpret the results and prescribe treatment.  The FDA is with academia and medical labs step for step on this movement.  Insurance companies are hiring experts to deal with the risk implications of genetic predispositions.  Personalized Medicine is here, it’s a hopeful, inspiring and exciting time to be a consumer of it!

Thanks for reading,


Disclaimer:  To keep with the theme of staying informed, I encourage you to simply google Big Data and Personalized Medicine to get more information.  This is one person’s conceptual approach, I am not the authority on either of these topics.

Image credit:
picture above
Steve Baker, @bakerture,