Project description

Genomic Origin Through Taxonomic CHAllenge (GOTTCHA)

GOTTCHA is an application of a novel, gene-independent and signature-based metagenomic taxonomic profiling method with significantly smaller false discovery rates (FDR) that is laptop deployable. Our algorithm was tested and validated on twenty synthetic and mock datasets ranging in community composition and complexity, was applied successfully to data generated from spiked environmental and clinical samples, and robustly demonstrates superior performance compared with other available tools.

GOTTCHAv2 is currently under development in BETA stage. Pre-built databases for v1 are incompatible with v2.

DEPENDENCIES

GOTTCHA2 profiler is written in Python3 and leverage minimap2 to map reads to signature sequences. In order to run GOTTCHA2 correctly, your system requires to have following dependencies installed correctly. The YAML file for Conda environment can be found in environment.yml.

Python 3.4+
minimap2 2.1+
pandas
samtools

QUICK START

Download or git clone GOTTCHA2 from this repository and cd to the cloned directory.
Run pip install .
Download the latest version of the GOTTCHA2 database. (This step may take some time)
```
 https://ref-db.edgebioinformatics.org/gottcha2/RefSeq-r220/
```

Run GOTTCHA2:

 $ gottcha2.py -d RefSeq-r220_BAVxH-cg/gottcha_db.species.fna -t 8 -i <FASTQ>
 
 OR
 
 $ gottcha2 profile -d RefSeq-r220_BAVxH-cg/gottcha_db.species.fna -t 8 -i <FASTQ>

RESULT

GOTTCHA2 can output the profiling results in either CSV, TSV or BIOM format.

summary (.tsv or .csv) - A summary of profiling results (10 columns) in taxonomic ranks breakdown
full (.tsv or .csv) - A full profiling results including unfiltered profiling results and additional columns
lineage (.lineage.tsv or .lineage.tsv) - output lineage and abundance of the profiled taxon per line
extract (.extract[TAXID].fastq) - Extracted reads for a specific taxon.

Summary and full reports

A full GOTTCHA2 report has 22 columns in tab-delimited format. The summary report is a brief version that has the first 10 columns and qualified taxonomies. The report lists profiling results at taxonomic rank breakdown from superkingdom to strain, following by other information listed below. The rollup depth of coverage (ROLLUP_DOC) is used to calculate relative abundance (column 10) by default, as well as other relative abundance calculations (column 19-21).

COLUMN	NAME	DESCRIPTION	NOTE
1	LEVEL	Taxonomic rank
2	NAME	Taxonomic name
3	TAXID	Taxonomic ID
4	READ_COUNT	Number of mapped reads
5	TOTAL_BP_MAPPED	Total bases of mapped reads
6	TOTAL_BP_MISMATCH	Total mismatch bases of mapped reads
7	LINEAR_LENGTH	Number of non-overlapping bases covering the signatures
8	LINEAR_DOC	Linear depth-of-coverage	= TOTAL_BP_MAPPED / LINEAR_LENGTH
9	ROLLUP_DOC	Rollup depth-of-coverage	= ∑ DOC of sub-level
10	REL_ABUNDANCE	Relative abundance (normalized abundance)	= ABUNDANCE / ∑ ABUNDANCE of given level
11	LINEAR_COV	Proportion of covered signatures to total signatures of mapped organism(s)	= LINEAR_LENGTH / SIG_LENGTH_TOL
12	LINEAR_COV_MAPPED_SIG	Proportion of covered signatures to mapped signatures	= LINEAR_LENGTH / SIG_LENGTH_MAPPED
13	BEST_LINEAR_COV	Best linear coverage of corresponding taxons
14	DOC	Average depth-of-coverage	= TOTAL_BP_MAPPED / SIG_LENGTH_TOL
15	BEST_DOC	Best DOC of corresponding taxons
16	SIG_LENGTH_TOL	Length of all signatures in mapped organism(s)
17	SIG_LENGTH_MAPPED	Length of signatures in mapped signature fragment(s)
18	ABUNDANCE	abundance of the taxon	value of either ROLLUP_DOC, READ_COUNT or TOTAL_BP_MAPPED
19	ZSCORE	Estimated Z-score of depth of coverage of the mapped region
20	NOTE	Only note the reason for being filtered out

USAGE

usage: gottcha2.py [-h] [-i [FASTQ] [[FASTQ] ...]] [-s [SAMFILE]]
                   [-d [MINIMAP2_INDEX]] [-l [LEVEL]] [-ti [FILE]] [-np]
                   [-pm <INT>] [-e [TAXID]] [-fm [STR]] [-r [FIELD]]
                   [-t <INT>] [-o [DIR]] [-p <STR>] [-xm <STR>] [-mc <FLOAT>]
                   [-mr <INT>] [-ml <INT>] [-mz <FLOAT>] [-nc] [-c] [-v]
                   [--silent] [--debug]

Genomic Origin Through Taxonomic CHAllenge (GOTTCHA) is an annotation-
independent and signature-based metagenomic taxonomic profiling tool that has
significantly smaller FDR than other profiling tools. This program is a
wrapper to map input reads to pre-computed signature databases using minimap2
and/or to profile mapped reads in SAM format. (VERSION: 2.1.5 BETA)

optional arguments:
  -h, --help            show this help message and exit
  -i [FASTQ] [[FASTQ] ...], --input [FASTQ] [[FASTQ] ...]
                        Input one or multiple FASTQ/FASTA file(s). Use space
                        to separate multiple input files.
  -s [SAMFILE], --sam [SAMFILE]
                        Specify the input SAM file. Use '-' for standard
                        input.
  -d [MINIMAP2_INDEX], --database [MINIMAP2_INDEX]
                        The path of signature database. The database can be in
                        FASTA format or minimap2 index (5 files).
  -l [LEVEL], --dbLevel [LEVEL]
                        Specify the taxonomic level of the input database. You
                        can choose one rank from "superkingdom", "phylum",
                        "class", "order", "family", "genus", "species" and
                        "strain". The value will be auto-detected if the input
                        database ended with levels (e.g. GOTTCHA_db.species).
  -ti [FILE], --taxInfo [FILE]
                        Specify the path of taxonomy information file
                        (taxonomy.tsv). GOTTCHA2 will try to locate this file
                        when user doesn't specify a path. If '--database'
                        option is used, the program will try to find this file
                        in the directory of specified database. If not, the
                        'database' directory under the location of gottcha.py
                        will be used as default.
  -np, --nanopore       Adjust options for Nanopore reads. The 'mismatch'
                        option will be ignored. [-xm map-ont -mr 1]
  -pm <INT>, --mismatch <INT>
                        Mismatch penalty for the aligner. [default: 10]
  -e [TAXID], --extract [TAXID]
                        Extract reads mapping to a specific TAXID. [default:
                        None]
  -fm [STR], --format [STR]
                        Format of the results; available options include tsv,
                        csv or biom. [default: tsv]
  -r [FIELD], --relAbu [FIELD]
                        The field will be used to calculate relative
                        abundance. You can specify one of the following
                        fields: "LINEAR_LENGTH", "TOTAL_BP_MAPPED",
                        "READ_COUNT" and "LINEAR_DOC". [default: ROLLUP_DOC]
  -t <INT>, --threads <INT>
                        Number of threads [default: 1]
  -o [DIR], --outdir [DIR]
                        Output directory [default: .]
  -p <STR>, --prefix <STR>
                        Prefix of the output file [default:
                        <INPUT_FILE_PREFIX>]
  -xm <STR>, --presetx <STR>
                        The preset option (-x) for minimap2. Default value
                        'sr' for short reads. [default: sr]
  -mc <FLOAT>, --minCov <FLOAT>
                        Minimum linear coverage to be considered valid in
                        abundance calculation [default: 0.005]
  -mr <INT>, --minReads <INT>
                        Minimum number of reads to be considered valid in
                        abundance calculation [default: 3]
  -ml <INT>, --minLen <INT>
                        Minimum unique length to be considered valid in
                        abundance calculation [default: 60]
  -mz <FLOAT>, --maxZscore <FLOAT>
                        Maximum estimated zscore of depths of mapped region
                        [default: 10]
  -nc, --noCutoff       Remove all cutoffs. This option is equivalent to use
                        [-mc 0 -mr 0 -ml 0].
  -c, --stdout          Write on standard output.
  -v, --version         Print version number.
  --silent              Disable all messages.
  --debug               Debug mode. Provide verbose running messages and keep
                        all temporary files.

usage: pull_database.py [-h] [-u URL] [-r RANK]

This script will pull the latest version of the Gottcha2 database.

optional arguments:
  -h, --help            show this help message and exit
  -u URL, --url URL     specify a URL to pull from (will override the default)
  -r RANK, --rank RANK  taxonomic rank of the database (superkingdom, phylum, class, order, famiily, genus, species)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

2.1.8.55

Feb 9, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GOTTCHA2-2.1.8.55.tar.gz (25.6 kB view hashes)

Uploaded Feb 9, 2024 Source

Built Distribution

GOTTCHA2-2.1.8.55-py3-none-any.whl (28.2 kB view hashes)

Uploaded Feb 9, 2024 Python 3

Hashes for GOTTCHA2-2.1.8.55.tar.gz

Hashes for GOTTCHA2-2.1.8.55.tar.gz
Algorithm	Hash digest
SHA256	`3d1c3124e2d658a007fc12e4c401314480dff978316bf07846e33919b357c20c`
MD5	`33cf7382bd689f29953aea531263b822`
BLAKE2b-256	`87dd1944dbf6bd0d188a9bac1b957d6e5b47765c179fb2f1477b4292e53698a6`

Hashes for GOTTCHA2-2.1.8.55-py3-none-any.whl

Hashes for GOTTCHA2-2.1.8.55-py3-none-any.whl
Algorithm	Hash digest
SHA256	`338455cc13f4f20d6aed2eab9bfbe2f134a612ec7c55ed4eba41757b2b614c27`
MD5	`6afa05d3a22fb25d6b8895a2635f8a07`
BLAKE2b-256	`83246abb5e2ec95afa5c2f8925c786de7ea3daf59cfbb6f41939a7f1b0de3cb5`