Skip to main content

A package for machine learning with tabular data

Project description

TabML: a Machine Learning pipeline for tabular data

Python 3.8 Python 3.9 tests codecov

Introduction

This is an active project that aims to create a general machine learning framework for working with tabular data.

Key features:

  • One of the most important tasks in working with tabular data is to hanlde feature extraction. TabML allow users to define multiple features isolatedly without worrying about other features. This helps reduce coding conflicts if your team have multiple members simultaneously developing different features. In addition, if one feature needs to be updated, unrelated features could be untouched. In this way, the computating cost is relatively small (compared with running a pipeline to re-generate all other features).

  • Parameters are specified in a config file as a config file. This config file is automatically saved into an experiment folder after each training for the reproducibility purpose.

  • Support multiple ML packages for tabular data:

Installation

pip install tabml

Main components

components

In TRAINING step,

  1. The FeatureManager class is responsible for loading raw data and engineering it into relavent features for model training and analysis. If a fit step, e.g. imputation, is required for a feature, the fitted parameters will be stored for using later in the transform step. One such usage is in the serving step when there is only transform step. For each project, there is one feature_manager.py file which specifies how each feature is computed (example). The computation order as well as feature dependencies are specified in a yaml config file (example).

  2. The DataLoader loads training and validation data for model training and analysis. In a typical project, tabml already takes care of this class, users only need to specify configuration in the pipeline config file (example). In that file, features and label used for training need to be specified. In addition, a set of boolean features are used as conditions for selecting training and validation data. Only rows in the dataset that meet all training/validation conditions are selected.

  3. The ModelWrapper class defines the model, how to train it and other methods for loading the model and making predictions.

  4. The ModelAnalysis analyzes the model on different metrics at user-defined dimensions. Analyzing metrics at different slices of data could determine if the trained model is biased to some feature value or any slice of data that model performance could be improved.

In SERVING step, raw data is fed into the fitted FeatureManager to get the transfomed features that the trained model could use. The model is then making predictions for the transformed features.

Examples

Please check the examples folder for several example projects. For each project:

python feature_manager.py  # to generate features
python pipelines.py  # to train the model

You can change some parameters in the config file then run python pipelines.py again.

In most project, users only need to focus their efforts on designing features. The feature dependecy is defined in a yaml config file and the feature implementation is stored in feature_manager.py.

Setup for development

Add path to this repo

Add the following lines to your shell config file (~/.bashrc, ~/.zshrc or any shell config file of your choice):

export TABML=<local_path_to_this_git_repo>
alias 2tabml='cd $TABML; source bashrc; source tabml_env/bin/activate; python3 setup.py install'

Create the environment

cd $TABML
python3 -m venv tabml_env
source tabml_env/bin/activate
pip3 install -r requirements.txt

Setup pre-commit to auto format code when creating a git commit:

pre-commit install

Check that everthing is working

by running test

2tabml
python3 -m pytest ./tests ./examples

Author's notes

How to release a new version

  1. Increase version in setup.py as in this PR example.

  2. Generate tar file:

python setup.py sdist
  1. Upload tar file:
twine upload dist/tabml-x.x.xx.tar.gz

Common errors

  1. SHAP

SHAP might not work for MacOS if Xcode version < 13, try to upgrade it to xcode 13. Related issue.

  1. LightGBM

pip install lightgbm might not work for MacOS, try to follow official installation guide for mac.


If you find a bug or want to request a feature, feel free to create an issue. Any Pull Request would be much appreciated.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabml-0.2.9.tar.gz (39.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page