Extract the keywords from the given text and assign root of the key for each cluster keys

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

KeyStem

KeyStem is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

Corresponding medium post can be found here.

About the Project
Getting Started
2.1. Installation
2.2. Basic Usage
2.3. Max Sum Distance
2.4. Maximal Marginal Relevance
2.5. Embedding Models

1. About the Project

Back to ToC

Although there are already many methods available for keyword generation (e.g., Rake, YAKE!, TF-IDF, etc.) I wanted to create a very basic, but powerful method for extracting keywords and keyphrases. This is where KeyStem comes in! Which uses BERT-embeddings and simple cosine similarity to find the sub-phrases in a document that are the most similar to the document itself.

First, document embeddings are extracted with BERT to get a document-level representation. Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity to find the words/phrases that are the most similar to the document. The most similar words could then be identified as the words that best describe the entire document.

KeyStem is by no means unique and is created as a quick and easy method for creating keywords and keyphrases. Although there are many great papers and solutions out there that use BERT-embeddings (e.g., 1, 2, 3, ), I could not find a BERT-based solution that did not have to be trained from scratch and could be used for beginners (correct me if I'm wrong!). Thus, the goal was a pip install keystem and at most 3 lines of code in usage.

2. Getting Started

Back to ToC

2.1. Installation

Installation can be done using pypi:

pip install keystem

2.2. Usage

The most minimal example can be seen below for the extraction of keywords:

from keystem import KeyStem

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal).
         A supervised learning algorithm analyzes the training data and produces an inferred function,
         which can be used for mapping new examples. An optimal scenario will allow for the
         algorithm to correctly determine the class labels for unseen instances. This requires
         the learning algorithm to generalize from the training data to unseen situations in a
         'reasonable' way (see inductive bias).
      """
ks_model = KeyStem()
keywords = ks_model.get_keygroups(doc)

You can set keyphrase_ngram_range to set the length of the resulting keywords/keyphrases:

>>> ks_model.get_keygroups(doc, keyphrase_ngram_range=(1, 1), stop_words=None)

{'index': {0: 0, 2: 1, 26: 15, 28: 16, 20: 11}, 'keywords': {0: ('supervised learning', 0.7096), 2: ('supervised', 0.6735), 26: ('supervised learning', 0.613), 28: ('supervised', 0.6125), 20: ('supervised', 0.5554)}, 'features': {0: 'supervised learning', 2: 'supervised', 26: 'supervised learning', 28: 'supervised', 20: 'supervised'}, 'cluster': {0: 0.0, 2: 0.0, 26: 0.0, 28: 0.0, 20: 0.0}, 'score': {0: 0.7096, 2: 0.6735, 26: 0.613, 28: 0.6125, 20: 0.5554}, 'label': {0: 'supervised learning', 2: 'supervised learning', 26: 'supervised learning', 28: 'supervised learning', 20: 'supervised learning'}

To extract keyphrases, simply set keyphrase_ngram_range to (1, 2) or higher depending on the number of words you would like in the resulting keyphrases:

>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)

{'index': {0: 0, 2: 1, 26: 15, 28: 16, 20: 11}, 'keywords': {0: ('supervised learning', 0.7096), 2: ('supervised', 0.6735), 26: ('supervised learning', 0.613), 28: ('supervised', 0.6125), 20: ('supervised', 0.5554)}, 'features': {0: 'supervised learning', 2: 'supervised', 26: 'supervised learning', 28: 'supervised', 20: 'supervised'}, 'cluster': {0: 0.0, 2: 0.0, 26: 0.0, 28: 0.0, 20: 0.0}, 'score': {0: 0.7096, 2: 0.6735, 26: 0.613, 28: 0.6125, 20: 0.5554}, 'label': {0: 'supervised learning', 2: 'supervised learning', 26: 'supervised learning', 28: 'supervised learning', 20: 'supervised learning'}

2.4. Maximal Marginal Relevance

To diversify the results, we can use Maximal Margin Relevance (MMR) to create keywords / keyphrases which is also based on cosine similarity. The results with high diversity:

The results with low diversity:

2.5. Embedding Models

KeyBERT supports many embedding models that can be used to embed the documents and words:

Sentence-Transformers
Flair
Spacy
Gensim
USE

Click here for a full overview of all supported embedding models.

Sentence-Transformers
You can select any model from sentence-transformers here and pass it through KeyStem with model:

from keystem import KeyStem
kw_model = KeyStem(model='all-MiniLM-L6-v2')

Or select a SentenceTransformer model with your own parameters:

from keystem import KeyStem
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyStem(model=sentence_model)

Flair
Flair allows you to choose almost any embedding model that is publicly available. Flair can be used as follows:

from keystem import KeyStem
from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
ks_model = KeyStem(model=roberta)

You can select any 🤗 transformers model here.

Citation

To cite KeyStem in your work, please use the following bibtex reference:

@misc{grootendorst2020keybert,
  author       = {Naga Kiran},
  title        = {KeyStem: Minimal keyword extraction with BERT and grouping to the stem of key.},
  year         = 2023,
  publisher    = {caspai},
  version      = {v0.0.1},
  url          = {http://caspai.in/}
}

References

Below, you can find several resources that were used for the creation of KeyStem but most importantly, these are amazing resources for creating impressive keyword extraction models:

Github Repos:

MMR: The selection of keywords/keyphrases was modeled after:

https://github.com/swisscom/ai-research-keyphrase-extraction

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.5

Jun 29, 2023

1.0.4

May 26, 2023

1.0.3

May 26, 2023

1.0.2

May 26, 2023

1.0.1

May 19, 2023

1.0.0

May 19, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

keystem-1.0.5.tar.gz (7.6 kB view hashes)

Uploaded Jun 29, 2023 Source

Built Distribution

keystem-1.0.5-py3-none-any.whl (8.3 kB view hashes)

Uploaded Jun 29, 2023 Python 3

Hashes for keystem-1.0.5.tar.gz

Hashes for keystem-1.0.5.tar.gz
Algorithm	Hash digest
SHA256	`e8d82f1365754e6f8dcbf00ef8d2e1f2bdb26bbbc51785f887f15d09cbdb292c`
MD5	`82aaa1ef45cafa5478eea35e316f0098`
BLAKE2b-256	`9daa3df0e0fe553dae544699d01cacad0697a3e7edb69eed92d55f6151fd4c52`

Hashes for keystem-1.0.5-py3-none-any.whl

Hashes for keystem-1.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6b106a06e640727a6f32ebbc5c8a39224176749d0aa78f19bd0bbd91f22fa010`
MD5	`e6d15ec5280450db240e2af44d2cc60b`
BLAKE2b-256	`48ed8fd20bcbb95667164e3ae3fc854384b74f6b08461508ed0d535301ef40b2`