Skip to main content

Extract Biber features from a document parsed and annotated by spaCy.

Project description

PyPI Version Downloads from PyPI

The pybiber package aggregates the lexicogrammatical and functional features described by Biber (1988) and widely used for text-type, register, and genre classification tasks.

The package uses spaCy part-of-speech tagging and dependency parsing to summarize and aggregate patterns.

Because feature extraction builds from the outputs of probabilistic taggers, the accuracy of the resulting counts are reliant on the accuracy of those models. Thus, texts with irregular spellings, non-normative punctuation, etc. will likely produce unreliable outputs, unless taggers are tuned specifically for those purposes.

See the documentation for description of the package’s full functionality.

See pseudobibeR for the R implementation.

Installation

You can install the released version of pybiber from PyPI:

pip install pybiber

Install a spaCY model:

python -m spacy download en_core_web_sm

Usage

To use the pybiber package, you must first import spaCy and initiate an instance. You will also need to create a corpus. The biber function expects a polars DataFrame with a doc_id column and a text column. This follows the convention for readtext and corpus processing using quanteda in R.

import spacy
import pybiber as pb
from pybiber.data import micusp_mini

The pybiber package requires a model that will carry out part-of-speech tagging and dependency parsing.

nlp = spacy.load("en_core_web_sm", disable=["ner"])

To process the corpus, use spacy_parse. Processing the micusp_mini corpus should take between 20-30 seconds.

df_spacy = pb.spacy_parse(micusp_mini, nlp)

After parsing the corpus, features can then be aggregated using biber.

df_biber = pb.biber(df_spacy)

License

Code licensed under Apache License 2.0. See LICENSE file.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page