Skip to main content
PyCon US is happening May 14th-22nd in Pittsburgh, PA USA.  Learn more

A fast 23andMe raw genome file parser

Project description

arv — a fast 23andMe parser for Python
======================================
|travis-status| |versions| |license| |pypi|

Arv (Norwegian; "heritage" or "inheritance") is a Python module for parsing raw
23andMe genome files. It lets you lookup SNPs from RSIDs.

.. code:: python

from arv import load, unphased_match as match

genome = load("genome.txt")

print("You are a {gender} with {color} eyes and {complexion} skin.".format(
gender = "man" if genome.y_chromosome else "woman",
complexion = "light" if genome["rs1426654"] == "AA" else "dark",
color = match(genome["rs12913832"], {"AA": "brown",
"AG": "brown or green",
"GG": "blue"})))

For my genome, this little program produces::

You are a man with blue eyes and light skin.

The parser is insanely fast, having been written in finely tuned C++, exposed
via Cython. A 2013 Xeon machine I've tested on parses a 24 Mb file into a hash
table in about 78 ms. The newer 23andMe files are smaller, and parses in a mere
62 ms!

Works with Python 2.7+ and 3+. Installable with pip!

.. code:: bash

pipinstallupgradearvSeebelowforsoftwarerequirements.Importantdisclaimer====================ItsveryimportanttotellyouthatI,theauthorofarv,ammerelyahobbyist!Iamaprofessionalsoftwaredeveloper,butnotageneticist,biologist,medicaldoctororanythinglikethat.Becauseofthat,thissoftwaremaynotonlylookweirdtopeopleinthefield,itmayalsocontainseriouserrors.Ifyoufindanyproblemwhatsoever,pleasesubmitaGitHubissue.ThisaslightlymodifiedversionofwhatIwrotefortheoriginalsoftwarecalled"dnatraits",andthesamegoesforthissoftware:InadditiontotheGPLv3licensingterms,andgiventhatthiscodedealswithhealthrelatedissues,Iwanttostressthattheprovidedcodemostlikelycontainserrors,orinvalidgenomereports.ResultsfromthiscodemustbeinterpretedasHIGHLYSPECULATIVEandmayevenbedownrightINCORRECT.Alwaysconsultanexpert(medicaldoctor,geneticist,etc.)forguidance.ItakeNORESPONSIBILITYwhatsoeverforanyconsequencesofusingthiscode,includingbutnotlimitedtolossoflife,money,spouses,selfesteemandsoon.UseatYOUROWNRISK.Theindendeduseisforcasual,educationalpurposes.Ifthiscodeisusedforresearchpurposes,pleasecrosscheckkeyresultswithothersoftware:Theparsercodemaycontainseriouserrors,forexample.Aninterestingstoryabouttheresearchpart:IoncereleasedaprettygoodMersenneTwisterPRNGforC++thatendedupbeingusedinresearch.Turnedouttheenginehadbugs,andbythetimeIhadfixedthem,apoorresearcherhadalreadyproducedresultswithit(hopefullynotpublished;Idontknow).Theguyhadtogobackandfixhisstuff,andIfeltterriblybadaboutit.Sobeware!Installation============TherecommendedwayistoinstallfromPyPi...code::bash pip install arv

This will most likely build Arv from source. The package will automatically
install Cython, but it doesn't check if you have a C++11 compiler. Furthermore,
it passes some additional compilation flags that are specific to clang/gcc.

If you have problems running ``pip install arv``, please open an issue on
GitHub with as much detail as possible (``g++/clang++ --version``, ``uname
-a``, ``python --version`` and so on).

If you set the environment variable ``ARV_DEBUG``, it will build with full
warnings and debug symbols.

You can also install it locally through ``setup.py``. The following builds and
tests, but does not install, arv:

.. code:: bash

pythonsetup.pytestIfyousettheenvironmentvariableARVBENCHMARKtoagenomefilenameandrunthetests,itwillperformashortbenchmark,reportingthebestparsingtimeonit.YoucanalsosetARVBENCHMARKCOUNT=<number>tochangehowmanytimesitshouldparsethegivenfile.Usage=====Firstyouneedtodumptherawgenomefilefrom23andMe.Youllfinditundertherawgenomebrowser,anddownloadthefile.Youmayhavetounzipitfirst:Theparserworksonthepuretextfiles.ThenyouloadthegenomeinPythonwith..code::python>>>genome=arv.load("filename.txt")>>>genome<Genome:SNPs=960613,name=filename.txt>ToseeifthereareanyYchromosomespresentinthegenome,..code::python>>>genome.ychromosomeTrueThegenomeprovidesadictlikeinterface.TogetagivenSNP,justentertheRSID...code::python>>>genome["rs123"]>>>snp<SNP:chromosome=7position=24966446genotype=AA>>>>snp.chromosome7>>>snp.position24966446>>>snp.genotype<GenotypeAA>TheGenotypeobjectcanbeconvertedtoastringwithstr,butitalsoallowsrichcomparisonswithstringsdirectly:..code::python>>>snp.genotype=="AA"Trueyoucangetitscomplementwiththe operator...code::python>>>type(snp.genotype)<classarv.Genotype>>>> snp.genotype<GenotypeTT>ThecomplementisimportantduetoeahSNPsorientation.Allof23andMeSNPsareorientedtowardsthepositive("plus")strand,basedontheGRCh37<https://www.ncbi.nlm.nih.gov/grc/human>referencehumangenomeassemblybuild.ButsomeSNPsonSNPediaaregivenwiththeminusorientation<http://snpedia.com/index.php/Orientation>.Forexample,todetermineifthehumaninquestionislikelylactosetolerantornot,wecanlookatrs4988235<http://snpedia.com/index.php/Rs4988235>.SNPediareportsitsStabilizedorientationtobeminus,soweneedtousethecomplement:..code::python>>>genome["rs4988235"].genotype<GenotypeAA>>>> genome["rs4988235"].genotype<GenotypeTT>ByreadingafewGWAS<https://en.wikipedia.org/wiki/Genomewideassociationstudy>researchpapers,wecanbuildaruletodetermineahumanslikelihoodforlactosetolerance:..code::python>>>arv.unphasedmatch( genome["rs4988235"].genotype,"TT":"Likelylactosetolerant","TC":"Likelylactosetolerant","CC":"Likelylactoseintolerant",None:"Unabletodetermine(genotypenotpresent)")LikelylactosetolerantNotethatreadingGWASpapersforhobbyistscanbeabittricky.Ifyouareahobbyist,besuretospendsometimereadingthepaperclosely,checkingupSNPsonplaceslikeSNPedia<http://snpedia.com>,dnSNP<https://www.ncbi.nlm.nih.gov/projects/SNP/>andOpenSNP<https://opensnp.org/genotypes>.Finally,havefun,butbeextremelycarefulaboutdrawingconclusionsfromyourresults.Commandlineinterface======================Youcanalsoinvokearvfromthecommandline:..code::bash python -m arv --help

For example, you can drop into a Python REPL like so:

.. code:: bash

pythonmarvreplgenome.txtgenome.txt...960614SNPs,maleTypegenometoseetheparsed23andMerawgenomefile>>>genome<Genome:SNPs=960614,name=genome.txt>>>>genome["rs123"]<SNP:chromosome=7position=24966446genotype=<GenotypeAA>>Ifyouspecifyseveralfiles,youcanaccessthemthroughthevariablegenomes.Theexampleatthetopofthisdocumentcanberunwithexample:..code::bash python -m arv --example genome.txt
genome.txt ... 960614 SNPs, male

genome.txt ... A man with blue eyes and light skin

License
=======

Copyright 2017 Christian Stigen Larsen

Distributed under the GNU GPL v3 or later. See the file COPYING for the full
license text. This software makes use of open source software; see LICENSES for
details.

.. |travis-status| image:: https://travis-ci.org/cslarsen/arv.svg?branch=master
:alt: Travis build status
:scale: 100%
:target: https://travis-ci.org/cslarsen/arv

.. |license| image:: https://img.shields.io/badge/license-GPL%20v3%2B-blue.svg
:target: http://www.gnu.org/licenses/old-licenses/gpl-3.en.html
:alt: Project License

.. |versions| image:: https://img.shields.io/badge/python-2%2B%2C%203%2B-blue.svg
:target: https://pypi.python.org/pypi/arv/
:alt: Supported Python versions

.. |pypi| image:: https://badge.fury.io/py/arv.svg
:target: https://badge.fury.io/py/arv

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page