Extract Biber features from a document parsed and annotated by spaCy.
Project description
The pybiber package aggregates the lexicogrammatical and functional features described by Biber (1988) and widely used for text-type, register, and genre classification tasks.
The package uses spaCy part-of-speech tagging and dependency parsing to summarize and aggregate patterns.
Because feature extraction builds from the outputs of probabilistic taggers, the accuracy of the resulting counts are reliant on the accuracy of those models. Thus, texts with irregular spellings, non-normative punctuation, etc. will likely produce unreliable outputs, unless taggers are tuned specifically for those purposes.
See the documentation for description of the package’s full functionality.
See pseudobibeR for the R implementation.
Installation
You can install the released version of pybiber from PyPI:
pip install pybiber
Install a spaCY model:
python -m spacy download en_core_web_sm
Usage
To use the pybiber package, you must first import spaCy and initiate an instance. You will also need to create a corpus. The biber
function expects a polars DataFrame with a doc_id
column and a text
column. This follows the convention for readtext and corpus processing using quanteda in R.
import spacy
import pybiber as pb
from pybiber.data import micusp_mini
The pybiber package requires a model that will carry out part-of-speech tagging and dependency parsing.
nlp = spacy.load("en_core_web_sm", disable=["ner"])
To process the corpus, use spacy_parse
. Processing the micusp_mini
corpus should take between 20-30 seconds.
df_spacy = pb.spacy_parse(micusp_mini, nlp)
After parsing the corpus, features can then be aggregated using biber
.
df_biber = pb.biber(df_spacy)
License
Code licensed under Apache License 2.0. See LICENSE file.