SeSG is a tool developed to help Systematic Literature Review researchers, specifically at the step of building a search string.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Programming Language
- Python :: 3
- Python :: 3.10

Project description

sesg

SeSG (Search String Generator) python package repository.

SeSG is a tool developed to help Systematic Literature Review researchers, specifically at the step of building a search string.

Installation

You can install with pip, poetry, or any other package manager:

poetry add sesg

Usage

For a more extensive example, please refer to this repository.

Generating a search string

from dataclasses import dataclass
from random import sample

from sesg.search_string import (
    SimilarWordsFinder,
    create_enrichment_text,
    generate_search_string,
    set_pub_year_boundaries,
)
from sesg.topic_extraction import create_docs, extract_topics_with_bertopic
from transformers import BertForMaskedLM, BertTokenizer


@dataclass
class Study:
    title: str
    abstract: str
    keywords: str


GS: list[Study] = []
QGS: list[Study] = sample(GS, len(GS) // 3)


def main():
    docs = create_docs(
        [
            {
                "title": s.title,
                "abstract": s.abstract,
                "keywords": s.keywords,
            }
            for s in QGS
        ]
    )

    enrichment_text = create_enrichment_text(
        [
            {
                "title": s.title,
                "abstract": s.abstract,
            }
            for s in QGS
        ]
    )

    similar_words_finder = SimilarWordsFinder(
        enrichment_text=enrichment_text,
        bert_model=BertForMaskedLM.from_pretrained("bert-base-uncased"),
        bert_tokenizer=BertTokenizer.from_pretrained("bert-base-uncased"),
    )

    topics = extract_topics_with_bertopic(
        docs,
        kmeans_n_clusters=2,
        umap_n_neighbors=5,
    )

    search_string = generate_search_string(
        topics,
        n_words_per_topic=5,
        n_similar_words_per_word=1,
        similar_words_finder=similar_words_finder,
    )

    search_string = f"TITLE-ABS-KEY({search_string})"
    search_string = set_pub_year_boundaries(search_string, min_year=2010, max_year=2020)

    print(search_string)
    # TITLE-ABS-KEY((("antipatterns") AND ("detection" OR "management") AND ("bdtex") AND ("approach" OR "algorithm") AND ("smurf")) OR (("code" OR "pattern") AND ("detection" OR "management") AND ("design" OR "software") AND ("software" OR "computer") AND ("learning" OR "translation"))) AND PUBYEAR > 1999 AND PUBYEAR < 2018  # noqa: E501


if __name__ == "__main__":
    main()

Assessing the quality of a search string

import trio
from sesg.evaluation import EvaluationFactory, Study
from sesg.scopus import InvalidStringError, Page, ScopusClient


API_KEYS: list[str] = []

GS: list[Study] = []
QGS: list[Study] = []


async def main():
    string = 'TITLE-ABS-KEY("machine learning" and "code smell") AND PUBYEAR > 2010 AND PUBYEAR < 2020'  # noqa: E501
    evaluation_factory = EvaluationFactory(gs=GS, qgs=QGS)

    client = ScopusClient(API_KEYS)

    entries: list[Page.Entry] = []
    try:
        async for page in client.search(string):
            entries.extend(page.entries)

    except InvalidStringError:
        print("Invalid string")

    evaluation = evaluation_factory.evaluate([e.title for e in entries])

    print(evaluation.start_set_recall)
    # 0.7


if __name__ == "__main__":
    trio.run(main)

Credits

This project is a continuation of Leo Fuchs' work. Most of my work in this project consisted in refactoring the codebase, adding tests, improving the documentation and optimizing the performance, along with the addition of some new features.

Highlights

Below you can find the major improvements over the original project:

Added BERTopic as a topic extraction strategy.
Improved snowballing performance by 100x~120x (thanks to rapidfuzz and multiprocessing).
Improved scopus search performance by 30x~40x (thanks to httpx and Eduardo Mendes' help).
Improved search string generation performance by ~1.5x (thanks to a caching system).
Improved code quality by adopting the use of lint and formatting tools. Also, added type hints to try to catch errors before runtime.
Added tests to prevent bugs when refactoring or adding new features.
Added docs to help users and contributors.

Contributing

You can contribute in many ways, such as creating issues and submitting pull requests. If you wish to contribute with code, please read the contributor guide.

License

This project is licensed under the terms of the GPL-3.0-only license.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Natural Language
- English
Programming Language
- Python :: 3
- Python :: 3.10

Release history Release notifications | RSS feed

This version

0.0.59

Jul 28, 2023

0.0.58

Jul 25, 2023

0.0.57

Jul 1, 2023

0.0.56

Jul 1, 2023

0.0.55

Jun 29, 2023

0.0.54

Jun 28, 2023

0.0.53

Jun 28, 2023

0.0.52

Jun 28, 2023

0.0.51

Jun 27, 2023

0.0.50

Jun 21, 2023

0.0.49

Jun 21, 2023

0.0.48

May 31, 2023

0.0.47

May 31, 2023

0.0.46

May 30, 2023

0.0.45

May 30, 2023

0.0.44

May 29, 2023

0.0.43

May 28, 2023

0.0.42

May 28, 2023

0.0.41

May 27, 2023

0.0.40

May 23, 2023

0.0.39

May 23, 2023

0.0.37

May 19, 2023

0.0.36

May 19, 2023

0.0.35

May 19, 2023

0.0.34

May 18, 2023

0.0.33

May 18, 2023

0.0.31

May 14, 2023

0.0.30

May 13, 2023

0.0.29

May 12, 2023

0.0.28

May 8, 2023

0.0.27

May 2, 2023

0.0.26

Apr 26, 2023

0.0.25

Apr 25, 2023

0.0.24

Apr 25, 2023

0.0.23

Apr 25, 2023

0.0.22

Apr 25, 2023

0.0.21

Apr 25, 2023

0.0.20

Apr 25, 2023

0.0.19

Apr 25, 2023

0.0.18

Apr 24, 2023

0.0.17

Apr 24, 2023

0.0.16

Apr 24, 2023

0.0.15

Apr 22, 2023

0.0.14

Apr 22, 2023

0.0.13

Apr 21, 2023

0.0.12

Apr 21, 2023

0.0.11

Apr 20, 2023

0.0.10

Apr 20, 2023

0.0.9

Apr 19, 2023

0.0.8

Apr 19, 2023

0.0.7

Apr 19, 2023

0.0.6

Apr 18, 2023

0.0.5

Apr 18, 2023

0.0.4

Apr 18, 2023

0.0.3

Apr 18, 2023

0.0.2

Apr 17, 2023

0.0.1

Apr 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sesg-0.0.59.tar.gz (37.5 kB view hashes)

Uploaded Jul 28, 2023 Source

Built Distribution

sesg-0.0.59-py3-none-any.whl (41.0 kB view hashes)

Uploaded Jul 28, 2023 Python 3

Hashes for sesg-0.0.59.tar.gz

Hashes for sesg-0.0.59.tar.gz
Algorithm	Hash digest
SHA256	`e4c5a4c989b9bea9996d9819a2ce4581ff9b9638212842f252db03b4f9e349c0`
MD5	`56afdb59466ec8d2f1e8656ccdba8b40`
BLAKE2b-256	`4f540ceb8a4bb61296433092c075c5a0e939be428a95924530fda043a2eaef27`

Hashes for sesg-0.0.59-py3-none-any.whl

Hashes for sesg-0.0.59-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f7937f51d04a82d8bc9d0dc9c90555e3efcf996a58a54b330bcdf2d4c94d3244`
MD5	`35c3b016efc95bb6365395911689407a`
BLAKE2b-256	`58457f7bbf3719e7b5aa68ab59f6de9819a89dede0b5a1176cca482477f8e0c9`