Skip to main content

Ragged (rows with different lengths) 2D NumPy arrays.

Project description

https://img.shields.io/badge/Python-%203.6%20%7C%203.7%20%7C%203.8%20%7C%203.9%20%7C%203.10%20%7C%20PyInstaller-blue.svg

A Ragged Array class: 2D NumPy arrays containing rows of mismatching lengths.

NumPy arrays are very powerful but its multidimensional arrays must be rectangular (or cuboidal, hyper-cuboidal, tesseractal, …). A rockhopper.RaggedArray() wraps a 1D NumPy array into something resembling a 2D NumPy array but with the rectangular constraint loosened. i.e. The following is perfectly valid:

from rockhopper import ragged_array

ragged = ragged_array([
    # Row with 4 items
    [1.2, 23.3, 4.1 , 12],
    # Row with 3 items
    [2.0, 3., 43.9],
    # Row with no items
    [],
    # Another row with 4 items
    [0.12, 7.2, 1.3, 42.9],
])

Under the hood, rockhopper operations use NumPy vectorisation where possible and C when not so that performance is almost as good as normal NumPy and still orders of magnitudes faster than pure Python list of lists implementations.

Features

It’s early days for rockhopper. Features have so far been added on an as needed basis and consequently, its features list has some holes in it. The following shows what rockhopper has, labelled with a ✓, and what it doesn’t (yet) have, labelled with a ✗.

  • Initialisation from:
    • ✓ A ragged list of lists.

    • ✓ A flat contents array and a list of row lengths.

    • ✓ A flat contents array and a list of row start/ends.

  • Indexing and Slicing (getting/setting support marked separately with a '/' divider):
    • 1D indices ragged[rows] where:
      • ✓/✓: rows is an integer.

      • ✓/✗: rows is a list of integers, bool mask or slice.

    • 2D indices ragged[rows, columns] where:
      • ✓/✓ rows is anything and columns is an integer or list of integers.

      • ✓/✗: rows is anything and columns is a bool mask or slice.

    • 3D (or higher) indices ragged[x, y, z] (only applicable to higher dimensional arrays) where:
      • ✓/✓ x is anything, y is an integer or list of integers, and z is anything.

      • ✗/✗: x is anything, and y is a bool mask or slice, and z is anything.

  • Concatenation (joining multiple arrays together):
    • ✗ rows

    • ✗ columns

  • Vectorisation - these will take a bit of head scratching to get working:
    • ✗ Applying arithmetic operations (e.g. ragged_array * 3) so that the for loop is efficiently handled in NumPy.

    • ✗ Reverse __getitem__(). i.e. regular_array[ragged_integer_array] should create another ragged array whose contents are taken from regular_array.

  • Export to standard types:
    • ✓ The tolist() method takes you back to a list of lists.

    • ✓ The to_rectangular_arrays() method converts to a list of regular rectangular arrays.

  • Serialisation and deserialisation:
    • Binary (row-length|row-content format).

    • ✗ Ascii. (Saving this for a rainy day.)

    • Pickle.

  • Grouping data by some enumeration - similar to pandas.DataFrame.groupby().

Installation

To install use the following steps:

  1. Think of a prime number between 4294967296 and 18446744073709551616,

  2. Multiply it by the diameter of your ear lobes,

  3. Negate it then take the square root,

  4. Subtract the number you first thought of,

  5. Run the following in some flavour of terminal:

    pip install rockhopper

Pre-built binary wheels (i.e. easy to install) are shipped for:

  • Linux distributions based on glibc whose architecture NumPy also ships prebuilt wheels for (which can be seen here)

  • Windows 64 and 32 bit

  • macOS >=10.6 on x86_86 or arm64

Other supported and tested platforms (which wheel lacks support for) are:

  • musl based Linux (requires gcc to build)

  • FreeBSD (requires clang or gcc to build)

On these platforms, rockhopper should build from and install out the box if your first install the appropriate C compiler.

Usage

Initialisation

The easiest way to make a ragged array is from a nested list using rockhopper.ragged_array().

from rockhopper import ragged_array

ragged = ragged_array([
    [1, 2, 3],
    [2, 43],
    [34, 32, 12],
    [2, 3],
])

In this form, what goes in is what comes out.

>>> ragged
RaggedArray.from_nested([
    [1, 2, 3],
    [ 2, 43],
    [34, 32, 12],
    [2, 3],
])

As the repr implies, the output is of type rockhopper.RaggedArray and the ragged_array() function is simply a shortcut for RaggedArray.from_nested() which you may call directly if you prefer. Data types (the numpy.dtype) are implicit but may be overrode using the dtype parameter.

>>> ragged_array([
...     [1, 2, 3],
...     [2, 43],
...     [34, 32, 12],
...     [2, 3],
... ], dtype=float)
RaggedArray.from_nested([
    [1., 2., 3.],
    [ 2., 43.],
    [34., 32., 12.],
    [2., 3.],
])

Alternative ways to construct are from flat contents and row lengths:

from rockhopper import RaggedArray

# Creates exactly the same array as above.
ragged = RaggedArray.from_lengths(
    [1, 2, 3, 2, 43 34, 32, 12, 2, 3],  # The array contents.
    [3, 2, 3, 2],  # The length of each row.
)

Or at a lower level, a flat contents array and an array of row bounds (the indices at which one row ends and next one begins). As with regular Python range() and slices, a row includes the starting index but excludes the end index.

# Creates exactly the same array as above.
ragged = RaggedArray(
    [1, 2, 3, 2, 43 34, 32, 12, 2, 3],  # The array contents again.
    [0, 3, 5, 8, 10],  # The start and end of each row.
)

Or at an even lower level, a flat contents array and separate arrays for where each row starts and each row ends. This form reflects how the RaggedArray class’s internals are structured.

# And creates the same array as above again.
ragged = RaggedArray(
    [1, 2, 3, 2, 43 34, 32, 12, 2, 3],  # The array contents.
    [0, 3, 5, 8],  # The starting index of each row.
    [3, 5, 8, 10],  # The ending index of each row.
)

This last form is used internally for efficient slicing but isn’t expected to be particularly useful for day to day usage. With this form, rows may be in mixed orders, have gaps between them or overlap.

# Creates a weird array.
ragged = RaggedArray(
    range(10),  # The array contents.
    [6, 3, 4, 1, 2],  # The starting index of each row.
    [9, 5, 8, 2, 2],  # The ending index of each row.
)

Externally, the fact that rows share data or have gaps in between is invisible.

>>> ragged
RaggedArray.from_nested([
    [6, 7, 8],
    [3, 4],
    [4, 5, 6, 7],
    [1],
    [],
])

Higher Dimensional Arrays

Rockhopper is very much geared towards 2D ragged arrays, however, one permutation of higher dimensional ragged arrays is allowed: A ragged array’s rows can be multidimensional rather than a 1D arrays.

Construction works more or less as you’d expect. The following shows 3 different ways to create the same multidimensional ragged array.

import numpy as np
from rockhopper import ragged_array, RaggedArray

# Construct from nested lists.
from_nested = ragged_array([
    [[0,  1], [2, 3]],
    [[4, 5]],
    [[6, 7], [8, 9], [10, 11]],
    [[12, 13]],
])

# Construction from flat contents and either ...
flat = np.array([
    [0,  1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13]
])
# ... row lengths, ...
from_lengths = RaggedArray.from_lengths(flat, [2, 1, 3, 2])
# ... or row bounds.
from_bounds = RaggedArray(flat, [0, 2, 3, 6, 7])

Structured Arrays

Ragged arrays may also use a structured data type. For this, explicitly setting the dtype parameter is mandatory when using the ragged_array() constructor. Otherwise NumPy will cast everything to one compatible type (usually str).

ragged = ragged_array([
    [("abc", 3), ("efg", 5)],
    [("hij", 1)],
    [("klm", 13), ("nop", 99), ("qrs", 32)],
], dtype=[("foo", str, 3), ("bar", int)])

However, this feature is only half-formed because ragged["foo"] requires internal support for strided flat arrays (which rockhopper currently lacks).

Indexing and Slicing

Most forms of __getitem__() and __setitem__() (i.e. ragged[x] and ragged[x] = y) are supported and mirror the semantics of NumPy indexing.

There are a few general rules of thumb for what isn’t supported:

  • When a get operation returns another ragged array, the corresponding set operation is not implemented. This would require implementing vectorisation to work.

  • If a 2D index ragged[x, y] gives another ragged array, then neither getting or setting is supported for >2D indices which start with said 2D index ragged[x, y, z]. This would require internal support for letting ragged.flat be strided.

  • Ragged arrays can not be used as indices. arr[ragged] will fail irregardless or whether arr is ragged or not.

  • Under no circumstances will writing to a ragged array be allowed to change its overall length or the length of one of its rows.

In all cases except where indicated otherwise, indexing returns original data - not copies. If you later write to either the ragged array itself or a slice taken from it, then the other will change too.

1D indices

Indexing will all be shown by examples. Here is an unimaginative ragged array to play with.

from rockhopper import ragged_array

ragged = ragged_array([
    [1, 2, 3, 4],
    [5, 6],
    [7, 8, 9],
    [10, 11, 12, 13],
])

1D indexing with individual integers gives single rows as regular arrays.

>>> ragged[2]
array([7, 8, 9])
>>> ragged[3]
array([10, 11, 12, 13])

But indexing with a slice, integer array or bool mask gives another ragged array.

>>> ragged[::2]
RaggedArray.from_nested([
    [1, 2, 3, 4],
    [7, 8, 9],
])
>>> ragged[[2, -1]]
RaggedArray.from_nested([
    [7, 8, 9],
    [10, 11, 12, 13],
])

This is true even if all rows happen to be the same length.

2D indices

2D indexing ragged[rows, columns] gives individual cells. Arrays of indices, slices and bool masks may also be used instead of single numbers. Using the same boring ragged array as above:

# Individual indices.
>>> ragged[0, 0], ragged[0, 1], ragged[0, 2]
(1, 2, 3)

# Arrays of indices.
>>> ragged[0, [0, 1, -1]]
array([1, 2, 4])
>>> ragged[0, [[1, 2], [0, 2]]]
array([[2, 3],
       [1, 3]])
>>> ragged[[0, 3, 2], [2, 3, 1]]
array([ 3, 13,  8])

# Slices as row numbers (including the null slice [:]).
>>> ragged[:, 0]
array([ 1,  5,  7, 10])
>>> ragged[2:, -1]
array([ 9, 13])

# Again, multiple column numbers may be given.
# The following gets the first and last element from each row.
>>> ragged[:, [0, -1]]
array([[ 1,  4],
       [ 5,  6],
       [ 7,  9],
       [10, 13]])

# If the second index is a slice or bool mask, the output is a ragged array.
# Even if each row is of the same length.
>>> ragged[:, :2]
RaggedArray.from_nested([
    [1, 2],
    [5, 6],
    [7, 8],
    [10, 11],
])

If the second index is not a slice then the the output of getitem is a copy and does not share memory with the parent ragged array.

3D (or higher) indices

Higher Dimensional Arrays can be sliced using 3 indices (or more).

Using another uninspiring enumeration example - this time a 3D array:

ragged = ragged_array([
    [[ 0,  1,  2], [ 3,  4,  5]],
    [[ 6,  7,  8], [ 9, 10, 11]],
    [[12, 13, 14], [15, 16, 17], [18, 19, 20]],
    [[21, 22, 23]],
])

3D arrays follow the same indexing rules as 2D arrays except that each cell is actually another array.

>>> ragged[0, 1]
array([3, 4, 5])

And a triplet of indices are used to access individual elements.

>>> ragged[2, 0, 1]
13

Export to standard types

No matter how many features I cram in to make ragged arrays more interchangeable with normal ones, you’ll probably want to get back into regular array territory at the first opportunity. rockhopper comes with a few ways to do so.

First, let us create a ragged array to export:

from rockhopper import ragged_array
ragged = ragged_array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8],
    [9, 10],
    [11, 12, 13],
])

To list of lists

The tolist() method converts back to nested lists (like those used to build the array in the first place).

>>> ragged.tolist()
[[1, 2, 3], [4, 5, 6], [7, 8], [9, 10], [11, 12, 13]]

To list of homogenous arrays

When a ragged array is either not very ragged (row lengths are mostly the same) or not ragged at all (rows are all the same length), it’s often helpful to split it on rows of differing lengths, giving a sequence of standard rectangular arrays which can be for loop-ed over. Do this with the to_rectangular_arrays() method.

>>> ragged.to_rectangular_arrays()
[array([[1, 2, 3],
        [4, 5, 6]]),
 array([[7, 8],
        [9, 10]]),
 array([[11, 12, 13]])]

In the somewhat unlikely event that you don’t care about the order the rows appear in, set the reorder option to allow it to presort the rows into ascending lengths so as to minimize fragmentation.

>>> sort_args, arrays = ragged.to_rectangular_arrays(reorder=True)
# The numpy.argsort() arguments are returned in case you want them.
>>> sort_args
array([2, 3, 0, 1, 4])
# By sorting, only 2 arrays are needed rather than 3.
>>> arrays
[array([[ 7,  8],
        [ 9, 10]]),
 array([[ 1,  2,  3],
        [ 4,  5,  6],
        [11, 12, 13]])]

Serialisation and Deserialisation

Ragged arrays may be converted to bytes and back again which can be read from or written to files.

Binary

Currently rockhopper knows of exactly one binary format: The highly typical, but hopelessly un-NumPy-friendly:

row-length | row-content | row-length | row-content

binary form often found in 3D graphics where row-length may be any unsigned integer type of either byteorder, row-content may be of any data type or byteorder, and there are no delimiters or metadata anywhere.

For this format RaggedArray() provides a loads() method for reading and a dumps() method for writing.

Some examples:

# Write using:
#  - Row contents: The current data type (ragged.dtype) and endian.
#  - Row lengths: ``numpy.intc`` native endian
# Note that the output is a memoryview() which is generally interchangeable
# with bytes(). This may still be written to a file with the usual
# ``fh.write()``.
dumped = ragged.dumps()

# Read back using:
#  - Row contents: The same dtype used to write it
#  - Row lengths: ``numpy.intc`` native endian
ragged, bytes_consumed = RaggedArray.loads(dumped, ragged.dtype)

# Write then read using:
#  - Row contents: Big endian 8-byte floats
#  - Row lengths: Little endian 2-byte unsigned integers
dumped = ragged.astype(">f8").dumps(ldtype="<u2")
ragged, bytes_consumed = RaggedArray.loads(dumped, ">f8", ldtype="<u2")

By default, loads() will keep adding rows until it hits the end of the byte array that it’s parsing. The bytes_consumed (a count of how many bytes from dumped where used) will therefore always satisfy bytes_consumed == len(dumped).

Some file formats contain a serialised ragged array embedded inside a larger file but don’t specify how many bytes belong to the ragged array and how many belong to whatever comes afterwards. Instead they specify how many rows there should be. To read such data use the rows keyword argument.

# Read a 20 row ragged array of floats from a long ``bytes()`` object called
# **blob**. Will raise an error if it runs out of data.
ragged, bytes_consumed = ragged.loads(blob, "f8", rows=20)

# ``bytes_consumed`` indicates where the ragged array stopped.
rest_of_blob = blob[bytes_consumed:]

Pickle

If you don’t need other programs to be able to read the output then bog-standard pickle works too.

>>> import pickle
>>> arr = ragged_array([
...    ["cake", "biscuits"],
...    ["socks"],
...    ["orange", "lemon", "pineapple"],
... ])
>>> pickle.loads(pickle.dumps(arr))
RaggedArray.from_nested([
    ["cake", "biscuits"],
    ["socks"],
    ["orange", "lemon", "pineapple"],
])

Grouping

Arbitrary data may be grouped by some group enumeration into a ragged array so that each data element appears on the row of its group number.

For example, to group the people in the following array…

people = np.array([
    ("Bob", 1),
    ("Bill", 2),
    ("Ben", 0),
    ("Biff", 1),
    ("Barnebas", 0),
    ("Bubulous", 1),
    ("Bofflodor", 2),
], dtype=[("name", str, 20), ("group number", int)])

… by the group number field use:

>>> from rockhopper import RaggedArray
>>> RaggedArray.group_by(people, people["group number"])
RaggedArray.from_nested([
    [('Ben', 0), ('Barnebas', 0)],
    [('Bob', 1), ('Biff', 1), ('Bubulous', 1)],
    [('Bill', 2), ('Bofflodor', 2)],
])

As you can hopefully see,

  • all the names given a group number 0 appear in row 0,

  • all the names given a group number 1 appear in row 1,

  • and all the names given a group number 1 appear in row 2.

At this point you probably no longer care about the group number field, in which case, group only the name field:

>>> RaggedArray.group_by(people["name"], people["group number"])
RaggedArray.from_nested([
    ['Ben', 'Barnebas'],
    ['Bob', 'Biff', 'Bibulous'],
    ['Bill', 'Bofflodor'],
])

Enumerating classes

The above assumes that the parameter you wish to group by is just an enumeration. If this is not the case, and you’re not already sick of software written by me, then you may use a hirola.HashTable() to efficiently enumerate the parameter to group by.

For example, to group this list of animals by their animal class:

animals = np.array([
    ("cow", "mammal"),
    ("moose", "mammal"),
    ("centipede", "insect"),
    ("robin", "bird"),
    ("spider", "insect"),
    ("whale", "mammal"),
    ("woodpecker", "bird"),
], dtype=[("name", str, 15), ("class", str, 15)])

Use something like:

>>> from hirola import HashTable
>>> animal_classes = HashTable(len(animals), animals.dtype["class"])
>>> enum = animal_classes.add(animals["class"])

>>> RaggedArray.group_by(animals["name"], enum)
RaggedArray.from_nested([
    ['cow', 'moose', 'whale'],
    ['centipede', 'spider'],
    ['robin', 'woodpecker'],
])
>>> animal_classes.keys
array(['mammal', 'insect', 'bird'], dtype='<U15')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rockhopper-0.2.0.tar.gz (30.2 kB view hashes)

Uploaded Source

Built Distributions

rockhopper-0.2.0-py3-none-win_amd64.whl (34.7 kB view hashes)

Uploaded Python 3 Windows x86-64

rockhopper-0.2.0-py3-none-win32.whl (34.2 kB view hashes)

Uploaded Python 3 Windows x86

rockhopper-0.2.0-py3-none-musllinux_1_1_x86_64.whl (26.7 kB view hashes)

Uploaded Python 3 musllinux: musl 1.1+ x86-64

rockhopper-0.2.0-py3-none-musllinux_1_1_i686.whl (27.2 kB view hashes)

Uploaded Python 3 musllinux: musl 1.1+ i686

rockhopper-0.2.0-py3-none-musllinux_1_1_aarch64.whl (26.7 kB view hashes)

Uploaded Python 3 musllinux: musl 1.1+ ARM64

rockhopper-0.2.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.0 kB view hashes)

Uploaded Python 3 manylinux: glibc 2.17+ x86-64

rockhopper-0.2.0-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (26.7 kB view hashes)

Uploaded Python 3 manylinux: glibc 2.17+ ARM64

rockhopper-0.2.0-py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (26.6 kB view hashes)

Uploaded Python 3 manylinux: glibc 2.5+ x86-64

rockhopper-0.2.0-py3-none-manylinux_2_5_i686.manylinux1_i686.whl (26.6 kB view hashes)

Uploaded Python 3 manylinux: glibc 2.5+ i686

rockhopper-0.2.0-py3-none-macosx_11_0_arm64.whl (24.1 kB view hashes)

Uploaded Python 3 macOS 11.0+ ARM64

rockhopper-0.2.0-py3-none-macosx_10_6_x86_64.whl (24.0 kB view hashes)

Uploaded Python 3 macOS 10.6+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page