hpo-downloader

Python package to download HPO annotations and mapping to Uniprot ID and AC and CAFA4 IDs.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

hpo_downloader

Python package to download HPO annotations and mapping to Uniprot ID and AC and CAFA4 IDs.

How do I install this package?

As usual, just download it using pip:

pip install hpo_downloader

Tests Coverage

Since some software handling coverages sometime get slightly different results, here’s three of them:

Pipeline

The package pipeline is illustrated in the following image:

Pipeline

Preprocessing

For the pre-processing you have to retrieve the uniprot mapping files by asking directly to the Uniprot team since each mapping is aroung 17GB. Let’s save each file in a directory within this repository called "mapping/{month}/idmapping.dat.gz".

Cache for the pre-processing results is available within the python package, so there is no need to retrieve the original files unless you need to fully reproduce the pipeline.

For each release, we have to retrieve the "GeneID" and the human uniprot_IDs, and we can do so using zgrep.

zgrep "GeneID" mapping/{month}/idmapping.dat.gz > gene_id.tsv
zgrep "HUMAN" mapping/{month}/idmapping.dat.gz > human_id.tsv

Now we have to map in a non-bijective way uniprot IDs to GeneIDs on the uniprot ACs. We can use the package method non_unique_mapping.

from hpo_downloader.utils import non_unique_mapping
import pandas as pd

gene_id = pd.read_csv(
    f"mapping/{month}/gene_id.tsv",
    sep="\t",
    header=None,
    usecols=[0, 2]
)
gene_id.columns = ["uniprot_ac", "gene_id"]
human_id = pd.read_csv(
    f"mapping/{month}/human_ids.tsv",
    sep="\t",
    header=None,
    usecols=[0, 2]
)
human_id.columns = ["uniprot_ac", "uniprot_id"]
non_unique_mapping(gene_id, human_id, "uniprot_ac").to_csv(
    f"hpo_downloader/uniprot/data/{month}.tsv.gz",
    sep="\t",
    index=False
)

Package usage examples

To generate the complete mapping (optionally filtering only for Uniprot IDs within CAFA4) proceed as follows:

from hpo_downloader import mapping

my_mapping = mapping(
    month="november"
)

my_mapping_cafa_only = mapping(
    month="november",
    cafa_only=True
)

The obtained pandas DataFrames look as follows:

HPO mappings: October, November, December

gene_id	hpo_id	uniprot_ac	uniprot_id
8192	HP:0004322	Q16740	CLPP_HUMAN
8192	HP:0001250	Q16740	CLPP_HUMAN
8192	HP:0000786	Q16740	CLPP_HUMAN
8192	HP:0000007	Q16740	CLPP_HUMAN
8192	HP:0000252	Q16740	CLPP_HUMAN

HPO mappings (CAFA4 only): October (CAFA only), November (CAFA only), December (CAFA only)

cafa4_id	uniprot_id	gene_id	hpo_id	uniprot_ac
T96060000002	1433E_HUMAN	7531	HP:0000960	P62258
T96060000002	1433E_HUMAN	7531	HP:0001539	P62258
T96060000002	1433E_HUMAN	7531	HP:0002119	P62258
T96060000002	1433E_HUMAN	7531	HP:0002120	P62258
T96060000002	1433E_HUMAN	7531	HP:0000463	P62258

Author notes

HPO missing GeneID mappings

Around 54 to 55 GeneID to Uniprot IDs mapping are currently missing in Uniprot. I have already signaled this to the Uniprot team and will update the package accordingly, if anything is to be made about these.

Month	HPO unique missed samples	HPO unique missed percentage	HPO total missed samples	HPO total missed percentage
October	54	1.26%	3076	1.86%
November	55	1.28%	3162	1.91%
December	55	1.28%	3162	1.91%

HPO phenotype ID to CAFA4 Uniprot_IDs missed mappings

A considerable percentage (around 80%) of the HUMAN uniprot IDs used in CAFA4 are not mappable to the HPO phenotype IDs.

Month	CAFA4 unique missed samples	CAFA4 unique missed percentage	CAFA4 total missed samples	CAFA4 total missed percentage
October	16182	79.21%	16182	79.21%
November	16184	79.22%	16184	79.22%
December	16187	79.23%	16187	79.23%

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.1.0

Jan 18, 2020

1.0.1

Jan 14, 2020

1.0.0

Jan 13, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hpo_downloader-1.1.0.tar.gz (8.1 kB view hashes)

Uploaded Jan 18, 2020 Source

Hashes for hpo_downloader-1.1.0.tar.gz

Hashes for hpo_downloader-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d7abdaab2609d734752410d93b7e53051f51b0681aae1566823c2170c59471ec`
MD5	`e307d728c7995cf2471cd9ad77b36d27`
BLAKE2b-256	`0e5bcc559525f2b85526904261f9c93b3b2b656f02d532b8a47bdc40540a835f`