word2vec-stream
Process your massive word2vec binary model file as a readable stream of records.
Purpose
Word2vec models are typically distributed as massive binary files (for instance, the standard GoogleNews set is several gigs once unzipped). In some cases, you may wish to process these models and persist all or part of their contents to a database or other source, without hitting the considerable memory usage needed to read it all into memory at once.
This tiny library is merely a handy function that parses the binary format and offers a readable stream of objects containing the word and the value (vector array).
Usage
The function exported by word2vec-stream
returns a promise, which resolves to a readable stream:
const word2vecStream = ; ;
A single word object looks like this:
word: 'runs' values: -003380169719457626 005194384977221489 -003704818710684776 0016614392399787903 00660756304860115 0030364234000444412 -0028072593733668327 -016270646452903748 -0038575947284698486 012756797671318054 // ... as many floats as vector dimensions here }
Or examine and run the demo.js file for a quick example (dumping records to console). Included tests also demonstrate basic invocation.
$ node demo.js
Compatibility
This library targets node v8, though may work a little further back; some necessary elements of the stream.Readable API may not be supported in older versions.
Acknowledgements
Thanks to node-word2vec for illustrating the basic syntax of parsing the binary format in node.