Skip to main content

Simple python package to generate and cache both random and chromosomal holdouts with arbitrary depth.

Project description

Travis CI build SonarCloud Quality SonarCloud Maintainability Codacy Maintainability Maintainability Pypi project Pypi total project downloads

Simple python package to generate and cache both random and chromosomal holdouts with arbitrary depth.

How do I install this package?

As usual, just download it using pip:

pip install holdouts_generator

Tests Coverage

Since some software handling coverages sometime get slightly different results, here’s three of them:

Coveralls Coverage SonarCloud Coverage Code Climate Coverate

Generating random holdouts

Suppose you want to generate 3 layers of holdouts, respectively with 0.3, 0.2 and 0.1 as test size and 5, 3 and 2 as quantity:

from holdouts_generator import holdouts_generator, random_holdouts
dataset = pd.read_csv("path/to/my/dataset.csv")
generator = holdouts_generator(
    dataset,
    holdouts=random_holdouts(
        [0.3, 0.2, 0.1],
        [5, 3, 2]
    )
)

for (training, testing), inner_holdouts in generator():
    for (inner_train, inner_test), small_holdouts in inner_holdouts():
        for (small_train, small_test), _ in small_holdouts():
            #do what you need :)

Generating balanced random holdouts

Suppose you want to generate 3 layers of holdouts, as above, but now you want to enforce to apply the same proportions for each class. In this setup, it is of foundamental importance to pass the list of classes as the last argument.

from holdouts_generator import holdouts_generator, balanced_random_holdouts
dataset = pd.read_csv("path/to/my/dataset.csv")
classes = pd.read_csv("path/to/my/classes.csv")
generator = holdouts_generator(
    dataset, classes,
    holdouts=balanced_random_holdouts(
        [0.3, 0.2, 0.1],
        [5, 3, 2]
    )
)

for (training, testing), inner_holdouts in generator():
    for (inner_train, inner_test), small_holdouts in inner_holdouts():
        for (small_train, small_test), _ in small_holdouts():
            #do what you need :)

Generating chromosomal holdouts

Suppose you want to generate 2 layers of holdouts, two outer ones with chromosomes 17 and 18 and 3 inner ones, with chromosomes 17/18, 20 and 21:

from holdouts_generator import holdouts_generator, chromosomal_holdouts
dataset = pd.read_csv("path/to/my/genomic_dataset.csv")
generator = holdouts_generator(
    dataset,
    holdouts=chromosomal_holdouts([
        ([17], [([18], None), ([20], None), ([21], None)])
        ([18], [([17], None), ([20], None), ([21], None)])
    ])
)

for (training, testing), inner_holdouts in generator():
    for (inner_train, inner_test), _ in inner_holdouts():
        #do what you need :)

Generating cached holdouts

To generate a cached holdout you just need to import instead of holdouts_generator the other method called cached_holdouts_generator. Everything else stays basically the same, except you receive also the holdout cached key for storing the results.

from holdouts_generator import cached_holdouts_generator, balanced_random_holdouts
dataset = pd.read_csv("path/to/my/dataset.csv")
classes = pd.read_csv("path/to/my/classes.csv")
generator = cached_holdouts_generator(
    dataset, classes,
    holdouts=balanced_random_holdouts(
        [0.3, 0.2],
        [5, 3]
    )
)

for (training, testing), key, inner_holdouts in generator():
    for (inner_train, inner_test), inner_key, small_holdouts in inner_holdouts():
        #do what you need :)

Clearing the holdouts cache

Just run the method clear_cache:

from holdouts_generator import clear_cache

clear_cache(
    cache_dir=".holdouts" # This is the default cache directory
)

Clearing the invalid holdouts

Sometimes it can happen that by moving around holdouts or simply by running parallel processes on clusters with machine with different specifics some holdouts can be created twice, overriding the original cache.

In this unlikely scenario, the holdouts will be marked as tempered. To delete these holdouts use the following:

from holdouts_generator import clear_invalid_cache

clear_invalid_cache(
    cache_dir=".holdouts" # This is the default cache directory
)

Clearing the invalid results

As you can get invalid holdouts, it is also possible to get invalid results that map to invalid holduts. For this reason there is a method to delete these results:

from holdouts_generator import clear_invalid_results

clear_invalid_results(
    results_directory: str = "results", # This is the default results directory
    cache_dir=".holdouts" # This is the default cache directory
)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

holdouts_generator-0.0.52.tar.gz (13.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page