Skip to main content

Commandline tool for parsing NGS reads by multiple fuzzy regex operations

Project description

itermae

See the concept here and tutorial here.

itermae is a command-line utility to recognize patterns in input sequences and generate outputs from groups recognized. Basically, it uses fuzzy regular expression operations to (primarily) DNA sequence for purposes of DNA barcode/tag/UMI parsing, sequence and quality -based filtering, and general output re-arrangment.

itermae diagram

itermae reads and makes FASTQ, FASTA, text-file, and SAM (tab-delimited) files using Biopython sequence records to represent slice, and read/output formats. Pattern matching uses the regex library, and the tool is designed to function in command-line pipes from tools like GNU parallel to permit light-weight parallelization.

It's usage might look something like this:

zcat seq_data.fastqz | itermae --config my_config.yml -v > output.sam

or

zcat seq_data.fastqz \
    | parallel --quote --pipe -l 4 --keep-order -N 10000 \
        itermae --config my_config.yml -v > output.sam

with a my_config.yml file that may look something like this:

matches:
    - use: input
      pattern: NNNNNGTCCTCGAGGTCTCTNNNNNNNNNNNNNNNNNNNNCGTACGCTGCAGGTC
      marking: aaaaaBBBBBBBBBBBBBBBccccccccccccccccccccDDDDDDDDDDDDDDD
      marked_groups:
          a:
              name: sampleIndex
              repeat: 5
          B:
              allowed_errors: 2
          c:
              name: barcode
              repeat_min: 18
              repeat_max: 22
          D:
              allowed_insertions: 1
              allowed_deletions: 2
              allowed_substititions: 2
output_list:
    -   name: 'barcode'
        description: 'description+" sample="+sampleIndex'
        seq: 'barcode'
        filter: 'statistics.median(barcode.quality) >= 35'

Availability, installation, 'installation'

Options:

  1. Use pip to install itermae, so

    python3 -m pip install itermae

  2. You can clone this repo, and install it locally. Dependencies are in requirements.txt, so python3 -m pip install -r requirements.txt will install those.

  3. You can use Singularity to pull and run a Singularity image of itermae.py, where everything is already installed. This is the recommended usage.

    This image is built with a few other tools, like g/mawk, perl, and parallel, to make command line munging easier.

Usage

itermae is envisioned to be used in a pipe-line where you just got your DNA sequencing FASTQ reads back, and you want to parse them. The recommended interface is the YAML config file, as demonstrated in the tutorial and detailed again in the configuration details. You can also use a command-line argument interface as detailed more in the examples.

I recommend you test this on small batches of data, then stick it behind GNU parallel and feed the whole FASTQ file via zcat in on standard input. This parallelizes with a small memory footprint, then you write it out to disk (or stream into another tool).

Thanks

Again, the tool is built upon on the excellent work of

Development, helping

Any issues or advice are welcome as an issue on the gitlab repo. Complaints are especially welcome.

For development, see the documentation as rendered from docstrings.

A set of tests is written up with pytest module, and can be run from inside the cloned repo with make test. See make help for more options, such as building, installing, and uploading.

There's also a bash script with some longer runs in profiling_tests, these generate longer runs for profiling purposes with cProfile and snakeviz. But is out of date. Todo is to re-configure and retest that for speed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

itermae-0.6.0.1.tar.gz (20.6 kB view hashes)

Uploaded Source

Built Distribution

itermae-0.6.0.1-py3-none-any.whl (20.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page