RNA benchmarking tools and utilities.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

RnaBench

A Comprehensive Library For In Silico RNA Modelling

Install

with Anaconda

conda env create -f environment.yml

Activate the environment with

conda activate RnaBench

For reproducing all plots of the main paper, you can run

./reproduce_all.sh

However, this call will take alot of time. We recommend a cheaper version, that reproduces the plots of the main paper with a limited amount of baselines, run

./reproduce_minimal.sh

The install and reproduction scripts were tested on a linux-64 platform.

with pip

Pre-requisites

Ensure you have Python version 3.9 installed. You can check your Python version with:

python --version

f you don't have Python 3.9, you can download it from the official Python website.

Step 1: Install `RnaBench` via pip

For infernal, choose one of the following methods:

Debian package:

If you are using Debian:

sudo apt-get install infernal infernal-doc

Homebrew Science package:

If you're using Homebrew:

brew tap brewsci/bio 
brew install infernal

Step 2: Install `RnaBench` via pip

Option A: From PyPI

pip install RnaBench

Option B: Directly from GitHub

To get the latest version directly from the repository:

pip install git+https://github.com/Rungetf/RnaBench.git

General Usage

We provide a simple API. All you need to do is to define a function that wraps your model's predictions. An example script for an RNA Folding model might look like this

import RnaBench

benchmark = RnaBench.RnaFoldingBenchmark()

model = YourModel()

def prediction_wrapper(task, *args, **kwargs):
    predicted_pairs = model(task.sequence)
    return predicted_pairs

metrics = benchmark(prediction_wrapper, *args, **kwargs)
print(metrics)

Examples

We provide simple examples with our baselines algorithms. All examples can be found in the examples directory.

You can e.g. run one of our RNA folding examples with

python -m examples.rna_folding_examples.<example file without .py suffix>

Data

We provide a bunch of different datasets. If you are only running the benchmarks, data will be downloaded on the fly as needed. However, you can download all datasets by running

python -m RnaBench.download

Our datasets are pandas dataframes. If you would like to build your own data with RnaBench, consider at least the following columns in the initial datasets:

'sequence': <List[str]> The sequence as a list of string characters
'pairs': <List[Tuple[int, int, int]]> A list of pairs, provided as triples of (pairing position 1, pairing position 2, level of nesting)
'Id': <str/int> The Id of the sample
'has_pk': Boolean if the sample contains pseudoknots
'has_multiplet': Bolean if the sample contains base multiplets
'has_nc': Boolean if the sample contains non-canonical base pairs
'length': The length of the sequence
'gc_content': The GC-nucleotide ration of the sequence
'origin': Where does this sample come from? Can be used to define datasets, e.g. TS0.

External Sources

The data processing pipeline for our datasets is also available. You can remove sequence similarity, blast the training set for homologs with test samples and query covariance models to further remove redundancy between training and test data. However, there are external libraries required to achieve complete functionality. To install all external sources, run

./install_external_algorithms.sh

Data Processing Pipeline

We provide config examples for our data pipeline at RnaBench/lib/data/data_configs. You can get a test run of the pipeline with

python -m RnaBench.lib.data.build_dataset --config_path RnaBench/lib/data/data_configs/reproduce_inter_family_data_config.yml

However, this run might take some minutes.

Building Covariance Models from Custom Data

python -m RnaBench.lib.data.build_cm --df_path <path to dataframe>

Biophysical Model data

Our pipeline includes a flag for sampling sequences from Rfam covariance models. The Rfam.cm and its preparation are part download script.

python -m RnaBench.download

We provide an example config file for the Biophysical model pipeline at RnaBench/lib/data/data_configs/biophysical_example_config.yml.

3D data

We use the data from RNAsolo for our 3D RNA data pipeline and provide a parser for the provided mmCIF structure files using Biopython. However, RNAsolo is updated very regularly to follow a new BGSU version.

We note: Whenever RNAsolo updates to a new BGSU version, our download of 3D data might not work out-of-the-box. You can change the BGSU version for the download script in the RnaBench/download.py script. you can look up the BGSU version of RNAsolo in the bottom left corner here:

https://rnasolo.cs.put.poznan.pl/archive

However, we also provide PDB's download script at RnaBench/lib/data/threedee that you can also use for downloading specific PDB IDs. Afterwards, you can run your model with RnaBench's dataset for 3D data as well and a torch dataloader. Moreover, we provide the representatives of the equivalence-classes for a resolution threshold at 1.5 angstrom in the data directory. An example script for the usage of our 3D data pipeline, including RMSD computation looks as follows

from torch.utils.data import DataLoader
from Bio.SVDSuperimposer import SVDSuperimposer

from RnaBench.lib.data.threedee.data import Rna3dDataset

mmcif_dir = "<path to directory containing mmCIF files>"

rna_3d_dataset = Rna3dDataset(mmcif_dir, device='cpu')
sup = SVDSuperimposer()
data_iterator = DataLoader(rna_3d_dataset, batch_size=64)

for i_batch, sampled_batch in enumerate(data_iterator):
    for b, length in enumerate(sampled_batch["sequence_length"].detach().cpu().numpy()):
        rna_sequence = [rna_3d_dataset.nucleotide_itos[i] for i in sampled_batch['sequence'][b, :length].detach().cpu().numpy()]
    for b, length in enumerate(sampled_batch["length"].detach().cpu().numpy()):
        true_x = sampled_batch['x_coordinate'][b, :length].detach().cpu().numpy()
        true_y = sampled_batch['y_coordinate'][b, :length].detach().cpu().numpy()
        true_z = sampled_batch['z_coordinate'][b, :length].detach().cpu().numpy()

        # do some prediction to get x, y, z coordinates for each atom
        x, y, z = model(rna_sequence)

        true_coords = np.stack([true_x, true_y, true_z], axis=1)
        pred_coords = np.stack([x, y, z], axis=1)

        sup.set(true_coords, pred_coords)
        sup.run()
        rmsd = sup.get_rms()
        print('RMSD:', np.round(rmsd, 4))

Baselines

Our baselines are either implemented within RnaBench, or are part of the installation of external algorithms.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.2

Oct 31, 2023

0.1

Oct 31, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

RnaBench-0.1.2-py3-none-any.whl (156.9 kB view hashes)

Uploaded Oct 31, 2023 Python 3

Hashes for RnaBench-0.1.2-py3-none-any.whl

Hashes for RnaBench-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`87741668dc89b19a46d9044b996c0a3a67c1e6edf0b33fe7b5357b509c2810d5`
MD5	`a7c4737e0397d5c4788a5274ac08e629`
BLAKE2b-256	`42238b3959a48a27a517775ddeeeb85c319cc8fbe1ba8c9324de1bc651a66ced`

RnaBench 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

RnaBench

Install

with Anaconda

with pip

Pre-requisites

Step 1: Install `RnaBench` via pip

Step 2: Install `RnaBench` via pip

Option A: From PyPI

Option B: Directly from GitHub

General Usage

Examples

Data

External Sources

Data Processing Pipeline

Building Covariance Models from Custom Data

Biophysical Model data

3D data

Baselines

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

RnaBench 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

RnaBench

Install

with Anaconda

with pip

Pre-requisites

Step 1: Install RnaBench via pip

Step 2: Install RnaBench via pip

Option A: From PyPI

Option B: Directly from GitHub

General Usage

Examples

Data

External Sources

Data Processing Pipeline

Building Covariance Models from Custom Data

Biophysical Model data

3D data

Baselines

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

Step 1: Install `RnaBench` via pip

Step 2: Install `RnaBench` via pip