A light weight database manager using HDF5
Project description
BAMBOOST
Bamboost is a Python library built for datamanagement using
the HDF5 file format.
bamboost stands for a lightweight shelf which will boost your efficiency and which
will totally break if you load it heavily. Just kidding, bamboo can fully carry pandas.
🐼🐼🐼🐼
Documentation
Installation
Install the latest release from the Package repository:
pip install bamboost
:warning: If you're system runs into problems installing
mpi4py
, make sure python header files are installed. Quickly google what you need (something likepython3-dev
,libpython3.8-dev
, etc.).
Install the package in editable mode for more flexibility, $e.g.$ if you plan to make changes yourself:
git clone git@gitlab.com:cmbm-ethz/bamboost.git
cd bamboost
pip install -e .
:warning: The option
-e
installs a project in editable mode from a local path. This way, you won't need to reinstall when pulling a new version or changing something in the package.
h5py with parallel support
For mpi support, h5py
must be installed with parallel support. Otherwise, eachp
process writes one after the other which takes forever. The default installation on
Euler is not enough.
It's simple, do the following:
export CC=mpicc
export HDF5_MPI="ON"
pip install --force-reinstall --no-deps --no-binary=h5py h5py
Requirements
python > 3.7
(if you're version is too low, it's very likely only because of typehints. Please report and we can remove/change it)
bamboost
depends on the following packages:
numpy
pandas
h5py
mpi4py
Usage
Manager
The main object of bamboost
is the Manager
. It manages the database located in the directory
specified during construction. It can display the parametric space, create new simulations, remove simulations
select a specific simulation based on it's uid
or on conditions of it's parameters.
Every database that is created is assigned a unique identifier (UID).
from bamboost import Manager
db = Manager('path/to/db')
pandas.DataFrame
is used to display the database. The dataframe is convenient and fast to filter or sort your entries:
db.df
An entry (from now on called simulation) within a database can be viewed, retrieved and modified with the Simulation
object.
To get the Simulation
object, access it with it's identifier or location (index) in the dataframe:
sim = db['uid']
sim = db[index]
sim = db.sim('uid')
All simulations can be returned as a (sorted) list. The argument select
can be used to
filter the simulations.
sims = db.sims() # returns all
sims = db.sims(select=(db.df.eps==1)) # returns all where eps is 1
sims = db.sims(sort='parameter1', reverse=False) # returns all, sorrted by parameter1
:warning: Note that this creates objects for every simulation and the sorting is not optimized. Using pandas to select and sort is much faster. Check their documentation for how to manipulate pandas dataframes.
Database index
Every database created will be assigned a unique identifier (UID).
The database path is stored with the UID in an index maintained at ~.config/bamboost
in your home directory. If it is not known, bamboost
will try to find it on your disk (you can add paths to search in ~.config/bamboost/known_paths.json
).
You can obtain a Manager object of any database from anywhere with it's UID. In notebooks, key completion will show you all known databases:
db = Manager.fromUID['UID']
The unique id makes refering to data safe. The full identifier of a simulation is considered to be '(database id):(simulation id)'
. It is encouraged to use the identifiers (instead of the path) to link from one simulation to a different one.
# add a link to a different simulation (e.g. the mesh)
sim.links['mesh_to_use'] = 'DATABASE-ID:simulation-id'
# the full id of a simulation is accessible as such
uid = sim.get_full_uid()
Write data
You can use bamboost
to write simulation or experimental data.
Use the Manager
to create a new simulation (or access an existing one).
Say you have (or want to create) a database at data_path
.
The code samples below shows the main functionality.
from bamboost import Manager
db = Manager(data_path)
params = {...} # dictionary of parameters (can have nested dictionaries)
writer = db.create_simulation(parameters=params)
writer.copy_file('path/to/file/') # copy a file which is needed to the database folder (e.g. executable, module list, etc.)
writer.change_note('This run is part of a series in which i investigate the effect of worms on apples')
# Use context manager (with block) and the file will be tagged 'running', 'finished', 'failed' automatically
with writer:
writer.add_metadata() # adds time and other metadata
writer.add_mesh(coordinates, connectivity) # Add a mesh, default mesh is named 'mesh'.
writer.add_mesh(coordinates, connectivity, mesh_name='interface') # add a second mesh for e.g. the interface
# loop through your time data and write
for t in times:
writer.add_field('field_data_1', array, time=t)
writer.add_field('field_data_2', array, time=t, mesh='interface')
writer.add_global_field('kinetic_energy', some_number)
writer.finish_step() # this increases the step counter
If you have an existing dataset, $e.g.$ because you created the simulation before and it holds the input parameters or similar. Do the following: You will need to pass the path and the uid to the script (best use argparse
).
from bamboost import SimulationWriter
with SimulationWriter(path, uid) as writer:
# Do anything
Userdata (Data not related to time and/or space)
The above functionality should be used for ordered data, such as timeseries of spatial data related to a mesh.
For anything else, there is the userdata
category. You can use it to store (almost) anything in the simulation file structured how you would like it. This is also useful to store computed values during postprocessing or plotting.
Internally, Userdata is an object handling a specific group ('/userdata') of the hdf5 file. To show the content of the group, display the object:
sim.userdata
You can create a subgroup, which will return a self-similar object for the new group ($e.g.$ '/userdata/plots'):
plot_grp = sim.userdata.require_group('plots')
Writing something to the file (group) is as easy as:
sim.userdata['avg_T'] = 34.56256
sim.userdata['traction_profile'] = np.array([...])
And reading:
# read avg_T
sim.userdata['avg_T']
# read dataset traction_profile
sim.userdata['traction_profile']
# note that this returns an object Dataset. To actually read the array, you will need to slice it
sim.userdata['traction_profile'][:]
Read data
The key purpose is convenient access to data. I recommend an interactive session (notebooks).
Display database
from bamboost import Manager
db = Manager(data_path)
To display the database with its parametric space simply input
db.df
Select a simulation of your dataset. sim
will be a SimulationReader
object.
sim = db[index]
sim = db[uid]
sims = db.sims((db.df.param1==2) & (db.df.param2>0), sort='param2') # will return list of all matching, sorted by param2
Show data stored: Display content of the data, userdata, globals groups:
sim.data
sim.userdata
sim.globals
This displays the stored fields and its sizes.
sim.data.info
Access a mesh: Directly access a tuple where [0] is the coordinates, [1] is the connectivity.
coords, conn = sim.mesh # default mesh
coords, conn = sim.get_mesh(mesh_name=...)
You can get a mesh object the following way.
mesh1 = sim.meshes['mesh1']
mesh1.coordinates # gives coordinates
mesh1.connectivity # gives connectivity
mesh1.get_tuple() # gives both the above
Access field data:
sim.data
acts as an accessor for all field data.
field1 = sim.data['field1']
field1[:], field1[0, :] # slice the dataset and you get numpy arrays (time, *spatial)
field1.at_step(-1) # similar for access of one step
field1.mesh # returns the linked mesh object (see above)
field1.msh # returns a tuple of the mesh (coordinates, connectivity)
field1.coordinates, field1.connectivity # direct access to linked mesh' coords and conn arrays
field1.times # returns timesteps of data
field1.shape # shape of data
field1.dtype # data type of data
Access global data:
sim.globals
kinetic_energy = sim.globals.kinetic_energy
Open file: All methods internally open the HDF5 file and make sure that it is closed again. Sometimes it's useful to keep the file open (i.e. to directly change something in the file manually). To do so, you are encouraged to use the following.
:warning: Do not open the file in write mode (
'w'
) as this truncates the file.
with sim.open(mode='r+') as file:
# do anything
# in here, you can still use all functions of the bamboost, the functions will not close
# the file in the case you manually opened the file...
Job management
You can use bamboost
to create euler jobs, and to submit them.
from bamboost import Manager
db = Manager(data_path)
params = {...} # dictionary of parameters (can have nested dictionaries)
sim = db.create_simulation(parameters=params)
sim.copy_file('path/to/postprocess.py') # copy a file which is needed to the database folder (e.g. executable, module list, etc.)
sim.copy_file('path/to/cpp_script')
sim.change_note('This run is part of a series in which i investigate the effect of worms on apples')
# commands to execute in batch job
commands = []
commands.append('./cpp_script')
commands.append(f'mpirun python {os.path.join(sim.path, 'postprocess.py')}') # e.g. to write the output to the database from cpp output
sim.create_batch_script(commands, ntasks=4, time=..., mem_per_cpu=..., euler=True)
sim.submit() # submits the job using slurm (works only in jupyterhub sessions on Euler)
To be continued...
Feature requests / Issues
Please open issues on gitlab: cmbm/bamboost
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.