OpenNMT Tokenizer as TensorFlow Operations

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

OpenNMT Tokenizer TensorFlow Ops

DISCLAIMER: This package is not published by the OpenNMT authors.
Full credits for OpenNMT Tokenizer and OpenNMT-tf goes to their respectively authors.

This project aims to wrap OpenNMT Tokenizer into TensorFlow Ops.

It's primarily intended to be used as an addition to the OpenNMT-tf framework, in order to remove the need of applying tokenization and/or detokenization outside of a serving environment (e.g. TensorFlow Serving).

Compatibility

TensorFlow 2.1, 2.2
OpenNMT-tf >= 2.6.0 for usage in conjunction with OpenNMT-tf

Installation

Prerequisites :

A Linux environment (manylinux2014 eligible)
Python 3.5, 3.6, 3.7 or 3.8

Install the package with pip :

pip install tensorflow-onmttok-ops

Usage

Available Tokenizer options

The majority of the OpenNMT Tokenizer options are available.
However, providing BPE or SentencePiece models is not supported, and by extension, setting the tokenizer mode to none is not supported.

You therefore cannot use the following options :

bpe_model_path
sp_model_path
sp_nbest_size
sp_alpha
vocabulary_path
vocabulary_threshold

Note: Tokenizer options are defined at graph construction time and are constants.

Tokenization

import tensorflow_onmttok as tf_onmttok

tokens = tf_onmttok.tokenize(["Hello, how are you?"], mode="conservative")

Detokenization

import tensorflow_onmttok as tf_onmttok

text = tf_onmttok.detokenize(["How", "are", "you", "?"], mode="space")

With OpenNMT-tf

Usage with OpenNMT-tf is pretty straightforward.
This package comes with a built-in tokenizer in order to make usage of the ops.

Before training your model, register the tokenizer as follows :

from tensorflow_onmttok import register_opennmt_in_graph_tokenizer

register_opennmt_in_graph_tokenizer()

See the complete example

Now that the tokenizer is registered, you can use the OpenNMTInGraphTokenizer class instead of OpenNMTTokenizer in your tokenization configuration files, e.g. :
```
type: OpenNMTInGraphTokenizer
params:
  mode: conservative
  case_feature: true
```
That's it ! You can now train your model as usual. Your ExportedModel will now expect a text input instead of tokens and length.

Note: Tokenization resources will not be exported to the assets.extra directory.

Build TF Serving with this Ops

This guide will show you how to build TensorFlow Serving with this ops.

Prerequisites

You have already cloned the TF Serving >= 2.1.0 repository, and have all tools installed for building it
You have installed CMake 3.1.0 or newer

Building

Add the Ops sources

First, download the release of your choice.

Inside the TF Serving sources folder, create a directory named custom_ops and copy the content of the tensorflow_onmttok directory into it.

$ cd <tf_serving_sources>
$ mkdir tensorflow_serving/custom_ops
$ cp -r <op_sources>/tensorflow_onmttok tensorflow_serving/custom_ops

Reference the Ops

Edit tensorflow_serving/model_servers/BUILD to reference the Ops build target :

SUPPORTED_TENSORFLOW_OPS = [
    ...
    "//tensorflow_serving/custom_ops/tensorflow_onmttok:onmttok_ops"
]

Build OpenNMT Tokenizer from sources

The last step is to build a static version of the OpenNMT Tokenizer library.
This repository provides a shell script that will build it with CMake.

$ cd <op_sources>
$ chmod +x build_tokenizer.sh && ./build_tokenizer.sh

Note: Pass sudo argument to the build_tokenizer.sh script to execute the make install command with sudo.

Build TensorFlow Serving

You can now build TensorFlow Serving as usual.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.4.0

Aug 21, 2020

0.3.0

Feb 18, 2020

0.2.1

Feb 17, 2020

0.2.0

Feb 17, 2020

0.1.1

Feb 14, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

tensorflow_onmttok_ops-0.4.0-cp38-cp38-manylinux2014_x86_64.whl (144.8 kB view hashes)

Uploaded Aug 21, 2020 CPython 3.8

tensorflow_onmttok_ops-0.4.0-cp37-cp37m-manylinux2014_x86_64.whl (144.8 kB view hashes)

Uploaded Aug 21, 2020 CPython 3.7m

tensorflow_onmttok_ops-0.4.0-cp36-cp36m-manylinux2014_x86_64.whl (144.8 kB view hashes)

Uploaded Aug 21, 2020 CPython 3.6m

tensorflow_onmttok_ops-0.4.0-cp35-cp35m-manylinux2014_x86_64.whl (144.8 kB view hashes)

Uploaded Aug 21, 2020 CPython 3.5m

Hashes for tensorflow_onmttok_ops-0.4.0-cp38-cp38-manylinux2014_x86_64.whl

Hashes for tensorflow_onmttok_ops-0.4.0-cp38-cp38-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`75eb8962f0af155244724c64e1dd48e985abe27dd773fd2762d782bcccdfdde8`
MD5	`ecedc11d1438f9799a6205b76fe4c427`
BLAKE2b-256	`19685818031172da3dce2558be7dbca0863b92de4c3151edf8a8f3dc81df4836`

Hashes for tensorflow_onmttok_ops-0.4.0-cp37-cp37m-manylinux2014_x86_64.whl

Hashes for tensorflow_onmttok_ops-0.4.0-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`fc9dc0a31d9a9786bd869246c87d5efdfe0abd84ef5afacd395d97a38d566fd8`
MD5	`93db0ecdc824b2b9cae71f2a4651661b`
BLAKE2b-256	`708608c8768f449aed80983641d30e3dd93d7e220dab315b1a2b6ce17a870bbf`

Hashes for tensorflow_onmttok_ops-0.4.0-cp36-cp36m-manylinux2014_x86_64.whl

Hashes for tensorflow_onmttok_ops-0.4.0-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`1c32c2d23e17a48fb7338359a42919da80f201c4dd305a6e804a4563d0457012`
MD5	`8bca973f4bc33264c15aa81b7d945000`
BLAKE2b-256	`ee263c07030c7adb4cd33d1162c5f7a18b0ff431f409cbb81c6c218d29d3f1a8`

Hashes for tensorflow_onmttok_ops-0.4.0-cp35-cp35m-manylinux2014_x86_64.whl

Hashes for tensorflow_onmttok_ops-0.4.0-cp35-cp35m-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`3aa028aab720c7dde021e89394a7c15be9655e7992ce38122a0eb3a750aeea37`
MD5	`5ebdb34ad6469b967b0b168098305d9f`
BLAKE2b-256	`eaed8fed6a5c4ed31c1dd32fe8a70b5814a7ad70ff6eeb078c3e633f27a9bbfd`