Skip to main content

Software package to filter variants, SNPs and INDELs, that are present in heterozygous form in phased genomes.

Project description

License: MIT

PHASEfilter

PHASEfilter is a software package to filter variants, SNPs and INDELs, that are present in heterozygous form in phased genomes. It is an easily implementable tool that provides a simple approach to detect and filter heterozygous SNPs and INDELs in diploid species based on a phased genome assembly.

Installation

This installation is oriented for Linux distributions.

Install directly

$ pip3 install PHASEfilter

Install with virtualenv

$ virtualenv PHASEfilter --python=python3 --prompt "(PHASEfilter 1.1.0) "
$ . PHASEfilter/bin/activate
(phasefilter 1.1.0) $ pip install PHASEfilter

## install all Software dependencies of PHASEfilter 
(phasefilter 1.1.0) $ cd PHASEfilter/bin/
(phasefilter 1.1.0) $ ./install_phasefilter_dependencies.sh

Install with conda

$ wget https://raw.githubusercontent.com/ibigen/PHASEfilter/main/conda/conda_phasefilter_env.yml -O conda_phasefilter_env.yml
$ conda env create -f conda_phasefilter_env.yml
$ conda activate PHASEfilter

The follow software must be available in your computer:

All software available

Filter variants in phased genomes

This software that can identify heterozygosity positions between two phased references. The software starts by aligning pairs of diploid chromosomes, based on Minimap2 aligner. With synchronization done it is possible to identify the position of a variation, in both pair of chromosomes, allowing variants to be removed if they meets some established criterias. To classify variants it is necessary to pass two VCF files, one for each reference phase. After that, the PHASEfilter will go through the variants called in reference A and check if there are any homologous in the variants called in reference B. For each variant called in the reference A it can happen three situations: 1) both references, for the position in analysis, are equal and the variant is valid; 2) the position is heterozygous in the references and the variant reflects it, so the variant is removed; 3) the position is heterozygous in the references and the variant is homozygous. It goes to the valid variants file but it also go to the Loss Of Heterozygous (LOH) file. The variant file in analysis it is always the one passed in parameter '--vcf1'.

$ phasefilter --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ phasefilter --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --vcf1 temp_raw_data/T1_Fluc_7A_snps.vcf.gz --vcf2 temp_raw_data/T1_Fluc_7B_snps.vcf.gz --out output_dir

## you can use chain if exists
$ phasefilter --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --vcf1 temp_raw_data/T1_Fluc_7A_snps.vcf.gz --vcf2 temp_raw_data/T1_Fluc_7B_snps.vcf.gz --out output_dir --chain_A_B temp_raw_data/Assembly22_hapA_To_Assembly22_hapB.over.chain --chain_B_A temp_raw_data/Assembly22_hapB_To_Assembly22_hapA.over.chain

Eighth possible files will be created after the commands ends. The outputs are from refrence A (ref1) to reference B (ref2), and from reference B (ref2) to reference A (ref1).

  • report_[A]to[B].txt - has the statistics about the analysis;

  • valid_[A]to[B].vcf.gz - has all variants that are not heterozygous between two references;

  • removed_[A]to[B].vcf.gz - has all heterozygous variants;

  • LOH_[A]to[B].vcf.gz - has all variants that are loss of heterozygous between two references. This variants are also in 'out_file.vcf.gz' file.

  • report_[B]to[A].txt - has the statistics about the analysis from ;

  • valid_[B]to[A].vcf.gz - has all variants that are not heterozygous between two references;

  • removed_[B]to[A].vcf.gz - has all heterozygous variants;

  • LOH_[B]to[A].vcf.gz - has all variants that are loss of heterozygous between two references. This variants are also in 'out_file.vcf.gz' file.

Headings description in report files:

  • Heterozygous (Removed) Heterozygous identified and they go the re remove_[YYY]_to[XXX].vcf.gz file
  • Keep alleles Alleles present in valid_[YYY]_to[XXX].vcf.gz file
  • LOH alleles Loss of Heterozygous They are in valid_[YYY]to[XXX].vcf.gz and LOH[YYY]_to[XXX].vcf.gz file.
  • Other than SNP Other variants thar are not SNPs and INDELs and they go to valid_[YYY]_to[XXX].vcf.gz file
  • Don't have hit position Variants that don’t have position in hit (ref B) genome and they go to valid_[YYY]_to[XXX].vcf.gz file
  • Could Not Fetch VCF Record on Hit Variants that are present in source file but not in hit VCF file. They go to valid_[YYY]_to[XXX].vcf.gz file
  • Total alleles All the alleles present in the source vcf file. Analyzed alleles.
  • Total Alleles new Source VCF Total alleles that are in valid_[YYY]_to[XXX].vcf.gz file
  • Method Alignment method.
  • Alignment % Percentage of alignment.

Note: You can can copy some example data to test the commands. {: .note}

Filter variants in phased genomes but only one direction

This tool do as the same of the previous script but only analysis from Reference A (ref1) to Reference B (ref2)

$ phasefilter_single --help
$ phasefilter_single --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --vcf1 A-M_S4_chrA_filtered_snps.vcf.gz --vcf2 A-M_S4_chrB_filtered_snps.vcf.gz --out_vcf out_result.vcf.gz

## with chain
$ phasefilter_single --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --vcf1 A-M_S4_chrA_filtered_snps.vcf.gz --vcf2 A-M_S4_chrB_filtered_snps.vcf.gz --out_vcf out_result.vcf.gz --chain temp_raw_data/Assembly22_hapA_To_Assembly22_hapB.over.chain

Synchronize annotation genomes

Synchronize annotations genomes adapting the annotations that are in reference 1 to the reference 2, adding the tags 'StartHit' and 'EndHit' to the result file. In VCF type files only add 'StartHit' tag in Info. The annotations (input file need to be in VCF or GFF3 and belong to the reference 1.

$ synchronize_genomes --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ synchronize_genomes --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --gff temp_raw_data/T1_Fluc_7A_snps.gff3.gz --out T1_Fluc_7A_snps.sync.gff3.gz
$ synchronize_genomes --ref1 S288C_reference.fna --ref2 S01.assembly.final.fa --gff S288C_reference.gff3 --out result.gff3 --pass_chr chrmt
$ synchronize_genomes --ref1 S288C_reference.fna --ref2 S01.assembly.final.fa --vcf S288C_reference.vcf.gz --out result.vcf.gz

Make alignments

Obtain the percentage of the minimap2 alignment between chromosomes and create an output in ClustalX format.

$ make_alignment --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ make_alignment --ref1 temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --ref2 temp_raw_data/Ca22chr7B_C_albicans_SC5314.fasta --out report.txt
$ make_alignment --ref1 Ca22chr1A_C_albicans_SC5314.fasta --ref2 Ca22chr1B_C_albicans_SC5314.fasta --out report.txt --pass_chr chrmt --out_alignment syncronizationSacharo

Reference Statistics

With this application it is possible to obtain the number of nucleotides by chromosome.

$ reference_statistics --help
## You can can copy some example data to test the commands
$ copy_raw_data_example_phasefilter --out temp_raw_data
$ reference_statistics --ref temp_raw_data/Ca22chr7A_C_albicans_SC5314.fasta --out report_stats.txt
$ reference_statistics --ref Ca22chr1A_C_albicans_SC5314.fasta.gz --out retport.txt

Copy some example data to test all tools

It is possible to copy some example data and test the tools available

$ copy_raw_data_example_phasefilter --help
$ copy_raw_data_example_phasefilter --out temp_dir

Documentation

PHASEfilter documentation is available in ReadTheDocs: PHASEfilter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PHASEfilter-1.1.0.tar.gz (61.7 kB view hashes)

Uploaded Source

Built Distribution

PHASEfilter-1.1.0-py3-none-any.whl (5.3 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page