Last holiday season, my brother gave me a 23andMe kit as a Christmas present. Personal genomics kits are excellent gift ideas… if you know what to expect. There are several companies that offer commercial “direct-to-consumer” whole genome sequencing at a fair price. Since I knew 23andMe provided its customers with raw data, I was excited to get started immediately. I spent some time on the 23andMe website, reading instructions for submitting my spit sample, finding out how long it would take before I had my results, and most importantly looking at what type of results I should expect. My personalized 23andMe dashboard prompted me to answer a seemingly endless supply of “research” questions ranging from, “do you snore?” to ” do you have family history of asthma?”. Weeks had gone by since submitting the sample, with the waiting my excitement gradually subsided. But when I got an email alert that the results were available, I couldn’t wait to get home, login to my 23andMe dashboard and solve all the genetic mysteries that were now unlocked.
23andMe pros & cons
Pro: Ancestry Data
Unfortunately, I was a little underwhelmed with my results. 23andMe provides ancestry data and how closely related you are to Neanderthals. But I already knew that my mom was 100% Irish and my dad was Polish and I don’t really care how closely related I am to Neanderthals! However, for people with a diverse and/or unknown ancestry, a tool like this would provide the opportunity for hours of self-guided internet research.
Pro: DNA Relatives
In addition to the Ancestry data, one of the primary features that 23andMe offers is a social “network” of relatives who are also registered and genotyped on 23andMe. Users have the option of sharing their genome with other users upon request as well as searching for DNA relatives. As it turns out, it was exciting to see that one of my first cousins was already genotyped and we share 11.5% of our DNA. As any user will quickly see, there are hundreds of “3rd – 5th cousins” who have been genotyped, you might recognize a user’s last name, or that small town in Maine where you know you have family. This feature has helped individuals find family members, which can be controversial; I’m sure there are hundreds of untold stories that don’t end quite as happily, but it’s is a risk you take when getting your genome sequenced.
Con: No phenotype info
The biggest reason I was underwhelmed was because there was so much data that 23andMe was housing, they knew everything about me from answering those survey questions and on top of that, they had my genome! With hundreds of thousands customers and presumably most of them answering the survey questions, surely 23andMe could figure out how to tell us which groups of people had blue eyes, or who was at risk for disease, right? It’s the worlds biggest science experiment and its all data driven!!! Scattered throughout the 23andMe website, you see leftover indicators that they were planning to offer this information to their customers at some point. But it’s not their fault, they lost a fight against the FDA in providing “health-related” genetic tests, in summary, direct-to-consumer genetics is difficult to navigate because when customers learn that they are in the marginally at risk group for a certain disease, it causes an uninformed overreaction. In the two and half years since 23andMe has made progress by offering a direct to consumer genetic test for Bloom Syndrome. Expecting parents can get this test now to help determine if their unborn child is at risk for the disease. The best way to think of this is like the test is now offered “over-the-counter”, you don’t have to visit a clinician to order the test. This is a very clean and clear scenario though, often times genetic testing requires a heavy dose of human interpretation of highly complex results, so it’s understandable that the FDA didn’t want 23andMe unleashing this data to its customers, the world just isn’t ready.
Pro: API and Raw Data
On the heels of my disappointment with lack of interpretation of the data, I was more impressed that 23andMe offered an application programming interface (API for short) which allows 3rd party developers to build applications that sync with 23andMe’s customer’s data. There are massive independent efforts out there that will provide some form of interpretation of your 23andMe results. Some examples are SNPedia, and opensnp as well as a list of tools at 23andyou. Customers that visit these sites get redirected to the official 23andMe sign-in page. There is a nice disclaimer that releases them from liability and informs the user that their data will be viewed. You have to pay for most of these services and I’m a bioinformatician, so I should be able to work on my own data… but now I have the tools to do so.
Beyond the novelty
SNPs, snps, snps
23andMe offers user’s raw data in the format of a list of ~600,000 SNPs (single nucleotide polymorphisms). Each SNP has a unique identifier and the result is a genotype, which is simply two letters (one from mom, one from dad) it looks like AA, for example. Here’s the first few lines from my results, then imagine ~600k of these SNPs.
SNPs are what make each of us different, they help explain the variance between different populations. In the 2000’s scientists learned that SNPs were very powerful in characterizing large groups of people. By performing whole genome arrays or sequencing on thousands of people, they could capture some significant correlations in the data. These types of studies, called genome-wide-association-studies or GWAS, are popular today because the accuracy of the test has increased and the cost has significantly decreased. If you think about it, 23andMe is currently doing the largest GWAS to date, and they’re publishing studies with impact on inherited diseases.
A single GWAS can take months to years to perform. It typically contains two groups of people, one group that presents with a specific trait of interest, and one control group that doesn’t have the trait of interest. The researchers could be looking at any trait ranging from serious diseases such as diabetes to more routine traits like hair color. The idea is to genotype the SNPs for the entire study population and determine if there are SNPs that are exclusive to one population over the other. Correlation doesn’t always imply causation, but with enough statistical power, there may be enough evidence to investigate certain SNPs in more detail.
Someone at the National Human Genome Research Institute (NHGRI) thought it would be a good idea to have all these splintered GWAS in one central data warehouse and publically accessible. What a great idea, it spawned a new database called the GWAS catalog curated by NHGRI, EMBL & EBI. Weekly, it extracts data from published GWAS studies, for a grand total of more than 100,000 high quality, highly significant SNPs. The best part is the data is free! It’s also easily accessible in the exact same format as the 23andMe data! The GWAS catalog can be similarly accessed using their API. This allows programmers the flexibility to build complex search queries and get real-time results.
The GWAS catalog was the perfect opportunity to take advantage of my 23andMe results. The GWAS catalog has a function to search for SNP associations by trait or disease, shown below. For my first search I wanted to keep it light-hearted, but not trivial either, so I looked up baldness to see what SNPs, if any, contributed to male-pattern baldness. Sure enough, there were 13 SNPs from 4 separate studies. Two of the top hits are in the Androgen Receptor gene (AR), which is well known to be implicated in hair loss.
We haven’t arrived at my application yet, anyone can go to the GWAS catalog and type in a disease or trait… But I wanted to take a minute to comment on the search results from the figure above. Notice you can filter several factors, including p-value and odds ratio. GWAS are inherently prone to false positive association results just because of sheer numbers. Statistics tell us that low p-values, lets say below 0.05 significance, means that a SNP is associated with a trait in in 5% of cases just due to random sampling error. So the lower the p-value, the more likely it is not associated by random chance. The odds ratio quantifies how strongly a characteristic (trait) is seen in one population over another. From two figures up, the “cases” group is 1.7 times more likely to have a “C” SNP than the “control” group. In summary odds ratios close to 1.0 are usually inconclusive, as it indicates a 50/50 split. The higher the odds ratio, the more likely a SNP is to be exclusive in one group over the other. Now we have a list of SNPs associated with baldness, and some numbers to quantify how strong those associations are, cool.
It’d be nice if I could take one of the 13 SNPs and see if I’m more likely to go bald or not. Currently, if I want to look up my SNPs, I’d have to type in each SNP one by one as shown below, and 23andMe would return my genotype for that SNP (AA or CT etc). This method, as you can imagine, is pretty laborious, even for 13 SNPs. If I search for celiac disease in the GWAS catalog, there are 90 SNPs, and diabetes has > 900, I’m not typing all those in.
This is the point when I decided to design an application that would allow me to investigate my own traits by tapping into the power of the GWAS catalog and marrying that with my 23andMe raw data. I wanted this all to be done in an automated fashion without typing in each SNP one-by-one. My application, it’s really a script, can be found on my GitHub page; if you know what your doing feel to use it.
The application starts by accepting a search term, which, in my case, was baldness, but it can be any trait or disease that the GWAS catalog has curated. The application uses the GWAS API to return a list of SNPs associated with that search term while recording the risk allele, p-value, odds ratio, and number of study participants. The risk allele determines which allele (A,T,C or G) is more prevalent in the risk group.
Depending on your search term there is a fairly long list of SNPs. I filter out SNPs using a method described here in the “associations” section. Basically if a study had less than 1,000 participants the statistics aren’t as convincing, also I filter so that only SNPs with a p-value < 5 * 10e-8 are included. This filter is programmatically enforced during the first step.
Now that I have a list of high quality SNPs from the GWAS, my application automatically pulls your 23andMe data only for these SNPs. You are prompted by 23andMe with a popup which will allow my application to pull your data. So the application pulls the list of GWAS SNPs from your 23andMe data in the form of rs123456: AA, rs987654: AT, etc. This is your list of SNPs associated with your search term. Finally, the application compares your genotype to the risk allele. It tells you whether you are normal, heterozygous, or homozygous for each SNP.
My application makes it easy to quickly investigate many of your own SNPs all at once, and gives you relevant information from a trusted source, the GWAS catalog. What my application does not do, is try to interpret the results. GWAS studies are controversial to begin with. Also, many diseases, including cancer, are too complex to put a finger on one SNP change that will have an impact on predicting whether or not someone will be at risk for a disease.
I did this application as a side project, I wanted to see what I could do with my own 23andMe genotype data and more importantly used the opportunity to expand my skill set. My next steps in the project will be to get the application on a webpage so anyone can visit and use it freely as well as try to get the results in a more meaningful format.
As always, thanks for reading, I appreciate any comments!