Skip to main content

From a BAM, convert each readgroup to a json/tsv object needed to create a GDC Read Group node.

Project description

gdc-readgroups

PyPI version

Purpose

This package will extract the Read Group header lines from a BAM file, and convert the contained metadata to a json or tsv file with appropriate values applied for creation of a Read Group node in the NCI's Genomic Data Commons (GDC). Optionally, it will take no input, and output a template which may be edited to create a submission to the GDC.

The generated file may contain some fields marked REQUIRED<type>, which indicates these fields could not be generated from the supplied BAM file. In this case, the user must apply their own desired values to the generated json. The <type> must be as indicated in the generated json file. For details, see the column Acceptable Types or Values at the GDC Data Dictionary Viewer.

Other fields are optional, and are marked OPTIONAL<type>. If these fields could not be generated from the supplied BAM file, they may be filled in as appropriate or removed.

Note

The tool will only run on complete BAM files - files which contain the suffix .bam.

If the BAM is truncated, the error

    OSError: no BGZF EOF marker; file may be truncated

will be generated, and no json will be produced.

Installation

There are 2 ways to install gdc-readgroups

pip install from pypi

gdc-readgroups may be used as a pip installed python package.

If you would like to install the package as root, for all users, run

sudo pip install gdc-readgroups

If you would like to install the package only for a local user, run

pip install gdc-readgroups --user

Build a Docker Image

The github repository for this package contains a Dockerfile, which may be used to build an image containing the package and all prerequisites. There are two ways to build the image.

  1. Using docker directly.

    wget https://raw.githubusercontent.com/NCI-GDC/gdc-readgroups/master/Dockerfile
    docker build -t gdc-readgroups .
    
  2. Using cwltool to build an image, and then run it, in one command.

    In this case the cwl tool will expect a BAM input, and produce a json output. To install the reference CWL engine, run

    pip install cwltool --user
    

    Then to build the gdc-readgroups Docker Image and run the Container, run

    wget https://raw.githubusercontent.com/NCI-GDC/gdc-readgroups/master/Dockerfile
    wget https://raw.githubusercontent.com/NCI-GDC/gdc-readgroups/master/gdc-readgroups.cwl
    cwltool gdc-readgroups.cwl --INPUT <your bam file>
    

    The above command will only build the Docker Image if it does not exist on the system. After the build is performed once, the image will remain on your system, and the next cwltool run will skip the build step.

Usage

gdc-readgroups has two main modes: bam-mode and template-mode.

bam-mode

In bam-mode, a path to a BAM file must be supplied as input. By default, bam-mode will output a json file, but optionally may output a tsv file.

The command to run the pip installed package is

gdc-readgroups bam-mode --bam_path <your bam file>

The generated json will be placed in the current working directory and have a filename of <bam basename>.json. Any error messages will be written to stdout.

To output a tsv file, run

gdc-readgroups bam-mode --bam_path <your bam file> --output-format tsv

The generated tsv file will be placed in your current working directory, and be of the form <bam basename>.tsv

template-mode

In template-mode, no input is supplied, and two empty records are output within one file, either in json or tsv format.

To generate a json template, run

gdc-readgroups template-mode

The output will be placed in the current working directory and have a filename of gdc_readgroups.json

To generate a tsv template, run

gdc-readgroups template-mode --output-format tsv

The output will be placed in the current working directory and have a filename of gdc_readgroups.tsv

Project details


Release history Release notifications | RSS feed

This version

0.4

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gdc_readgroups-0.4.tar.gz (9.7 kB view hashes)

Uploaded Source

Built Distribution

gdc_readgroups-0.4-py2.py3-none-any.whl (23.6 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page