Literary Language Toolkit (LLTK): corpora, models, and tools for the digital humanities

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Literary Language Toolkit (LLTK)

Corpora, models, and tools for the study of complex language.

Quickstart

See this notebook for a more interactive quickstart (run the code here on Binder).

Install

Open a terminal, Jupyter, or Colab notebook and type:

pip install -qU lltk-dh

# or for very latest version:
#pip install -qU git+https://github.com/quadrismegistus/lltk

Show available corpora:

lltk show

Or, within python, show in markdown:

import lltk
lltk.show()

Play with corpora

See below for available corpora.

# Load/install a corpus
import lltk
corpus = lltk.load('ECCO_TCP')           # load the corpus by name or ID

# Metadata
meta = corpus.meta                       # metadata as data frame
smpl = meta.query('1770<year<1830')      # easy query access         

# Data
mfw = corpus.mfw()                       # get the 10K most frequent words as a list
dtm = corpus.dtm()                       # get a document-term matrix as a pandas dataframe
dtm = corpus.dtm(tfidf=True)             # get DTM as tf-idf
mdw = corpus.mdw('gender')               # get most distinctive words for a metadata group

Play with texts

# accessing text objs
texts = corpus.texts()                   # get a list of corpus's text objects
texts_smpl = corpus.texts(smpl)          # text objects from df/list of ids 
texts_rad = corpus.au.Radcliffe          # hit "tab" after typing e.g. "Rad" to autocomplete 
text = corpus.t                          # get a random text object from corpus

# metadata access
text_meta = text.meta                    # get text metadata as dictionary
author = text.author                     # get common metadata as attributes    
title = text.title
year = text.year
dec = text.decade                        # few inferred attributes too
dec_str = text.decade_str                # '1890-1900' rather than 1890

# data access
txt = text.txt                           # get plain text as string
xml = text.xml                           # get xml as string

# simple nlp
words  = text.words                      # get list of words (excl punct)
sents = text.sents                       # get list of sentences
counts = text.counts                     # get word counts as dictionary (from JSON if saved)

# other nlp
tnltk = text.nltk                        # get nltk Text object
tblob = text.blob                        # get TextBlob object
tstanza = text.stanza                    # get list of stanza objects (one per para)
tspacy = text.spacy                      # get list of spacy objects (one per para)

Available corpora

LLTK has built in functionality for the following corpora. Some (🌞) are freely downloadable from the links below or the LLTK interface. Some of them (☂) require first accessing the raw data through your institutional or other subscription. Some corpora have a mixture, with some data open through fair research use (e.g. metadata, freqs) and some closed (e.g. txt, xml, raw).

name	desc	license	metadata	freqs	txt	xml	raw
ARTFL	American and French Research on the Treasury of the French Language	Academic	☂️	☂️
BPO	British Periodicals Online	Commercial	☂️				☂️
CLMET	Corpus of Late Modern English Texts	Academic	🌞	🌞	☂️	☂️
COCA	Corpus of Contemporary American English	Commercial	☂️	☂️	☂️		☂️
COHA	Corpus of Historical American English	Commercial	☂️	☂️	☂️		☂️
Chadwyck	Chadwyck-Healey Fiction Collections	Mixed	🌞	🌞	☂️	☂️	☂️
ChadwyckDrama	Chadwyck-Healey Drama Collections	Mixed	☂️	☂️	☂️	☂️	☂️
ChadwyckPoetry	Chadwyck-Healey Poetry Collections	Mixed	☂️	☂️	☂️	☂️	☂️
Chicago	U of Chicago Corpus of C20 Novels	Academic	🌞	🌞	☂️
DTA	Deutsches Text Archiv	Free	🌞	🌞	🌞	🌞	🌞
DialNarr	Dialogue and Narration separated in Chadwyck-Healey Novels	Academic	🌞	🌞	☂️
ECCO	Eighteenth Century Collections Online	Commercial	☂️	☂️	☂️	☂️	☂️
ECCO_TCP	ECCO (Text Creation Partnership)	Free	🌞	🌞	🌞	🌞	🌞
EEBO_TCP	Early English Books Online (curated by the Text Creation Partnership)	Free	🌞	🌞	🌞	🌞
ESTC	English Short Title Catalogue	Academic	☂️
EnglishDialogues	A Corpus of English Dialogues, 1560-1760	Academic	🌞	🌞		🌞
EvansTCP	Early American Fiction	Free	🌞	🌞	🌞	🌞	🌞
GaleAmericanFiction	Gale American Fiction, 1774-1920	Academic	🌞	🌞	☂️		☂️
GildedAge	U.S. Fiction of the Gilded Age	Academic	🌞	🌞	🌞
HathiBio	Biographies from Hathi Trust	Academic	🌞	🌞
HathiEngLit	Fiction, drama, verse word frequencies from Hathi Trust	Academic	🌞	🌞
HathiEssays	Hathi Trust volumes with "essay(s)" in title	Academic	🌞	🌞
HathiLetters	Hathi Trust volumes with "letter(s)" in title	Academic	🌞	🌞
HathiNovels	Hathi Trust volumes with "novel(s)" in title	Academic	🌞	🌞
HathiProclamations	Hathi Trust volumes with "proclamation(s)" in title	Academic	🌞	🌞
HathiSermons	Hathi Trust volumes with "sermon(s)" in title	Academic	🌞	🌞
HathiStories	Hathi Trust volumes with "story/stories" in title	Academic	🌞	🌞
HathiTales	Hathi Trust volumes with "tale(s)" in title	Academic	🌞	🌞
HathiTreatises	Hathi Trust volumes with "treatise(s)" in title	Academic	🌞	🌞
InternetArchive	19th Century Novels, curated by the U of Illinois and hosted on the Internet Archive	Free	🌞	🌞	🌞
LitLab	Literary Lab Corpus of 18th and 19th Century Novels	Academic	🌞	🌞	☂️
MarkMark	Mark Algee-Hewitt's and Mark McGurl's 20th Century Corpus	Academic	🌞	🌞	☂️
OldBailey	Old Bailey Online	Free	🌞	🌞	🌞	🌞
RavenGarside	Raven & Garside's Bibliography of English Novels, 1770-1830	Academic	☂️
SOTU	State of the Union Addresses	Free	🌞	🌞	🌞
Sellers	19th Century Texts compiled by Jordan Sellers	Free	🌞	🌞	🌞
SemanticCohort	Corpus used in "Semantic Cohort Method" (2012)	Free	🌞
Spectator	The Spectator (1711-1714)	Free	🌞	🌞	🌞
TedJDH	Corpus used in "Emergence of Literary Diction" (2012)	Free	🌞	🌞	🌞
TxtLab	A multilingual dataset of 450 novels	Free	🌞	🌞	🌞		🌞

Documentation

Incomplete for now. See this sample notebook for some examples.

New corpus

Import a corpus into LLTK:

lltk import                           # use the "import" command \
  -path_txt mycorpus/txts             # a folder of txt files  (use -path_xml for xml) \
  -path_metadata mycorpus/meta.xls    # a metadata csv/tsv/xls about those txt files \
  -col_fn filename                    # .txt/.xml filename col in metadata (use -col_id if no ext)

Or create a new one:

lltk create

Most frequent words

corpus.mfw_df(
    n=None,                            # Number of top words overall
    by_ntext=False,                    # Count number of documents not number of words
    by_fpm=False,                      # Count by within-text relative sums
    min_count=None,                    # Minimum count of word

    yearbin=None,                      # Average relative counts across `yearbin` periods
    col_group='period',                # Which column to periodize on
    n_by_period=None,                  # Number of top words per period
    keep_periods=True,                 # Keep periods in output dataframe
    n_agg='median',                    # How to aggregate across periods
    min_periods=None,                  # minimum number of periods a word must appear in

    excl_stopwords=False,              # Exclude stopwords (at `PATH_TO_ENGLISH_STOPWORDS`)
    excl_top=0,                        # Exclude words ranked 1:`not_top`
    valtype='fpm',                     # valtype to compute top words by
    **attrs
)

Document term matrix

corpus.dtm(
    words=[],                          # words to use in DTM
    n=25000,                           # if not `words`, how many mfw?
    texts=None,                        # set texts to use explicitly, otherwise use all
    tf=False,                          # return term frequencies, not counts
    tfidf=False,                       # return tfidf, not counts
    meta=False,                        # include metadata (e.g. ["gender","nation"])
    **mfw_attrs,                       # all other attributes passed to self.mfw()
)

Most distinctive words

corpus.mdw(                                 
    groupby,                           # metadata categorical variable to group by
    words=[],                          # explicitly set words to use
    texts=None,                        # explicitly set texts to use
    tfidf=True,                        # use tfidf as mdw calculation
    keep_null_cols=False,              # remove texts which do not have `groupby` set
    remove_zeros=True,                 # remove rows summing to zero
    agg='median',                      # aggregate by `agg`
)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.5.15

Dec 2, 2021

0.5.14

Oct 27, 2021

0.5.13

Jul 13, 2021

0.5.12

Jun 9, 2021

0.5.11

May 4, 2021

0.5.10

Apr 28, 2021

0.5.9

Apr 28, 2021

0.5.8

Mar 31, 2021

0.5.7

Mar 31, 2021

0.5.6

Mar 31, 2021

0.5.5

Mar 30, 2021

0.5.4

Mar 21, 2021

0.5.3

Mar 21, 2021

0.5.2

Mar 21, 2021

0.5.1

Mar 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lltk-dh-0.5.15.tar.gz (45.5 MB view hashes)

Uploaded Dec 2, 2021 Source

Hashes for lltk-dh-0.5.15.tar.gz

Hashes for lltk-dh-0.5.15.tar.gz
Algorithm	Hash digest
SHA256	`fe78cf42bc381bd6fd0c0b49dc3a832ef5ab73b631fb39024739a510388db574`
MD5	`b002faf2ee591b1e278e14908d77317b`
BLAKE2b-256	`3c1c9ed58c01fd634fb74ba1a70b991e0c59904c041e192ca1d708aea342866e`