API Reference

polonaexplorer.explorer module

Explorer for Polona corpus.

Finds files and text regions containing target words and generates dataframes for use in plotter.

class polonaexplorer.explorer.PolonaExplorer(targetwords: list, data_path: str, out_path: str, metadata_file_path: str, part: str = 'region')[source]

Bases: object

Explorer the polona2 corpus in METS/MODS format.

generate_dataframe() → Path[source]

Generate dataframe of all found page texts.

Uses metadata information to include publication date, title, place and more for the found periodicals. ID denotes the original identifier from the polona2 archive. Fragments contains the text data identified to contain fitting text by the original archive.

get_file_stats() → None[source]

Generate target words usage corpora.

Generates one file (part=page) with all files containing at least one target word, or two files (part=region) files with number of found words or extracted text regions containing at least one of the target words.

polonaexplorer.plotter module

Generates topic model maps for text containing target words.

class polonaexplorer.plotter.Plotter(data_path: str, out_path: str, year_range: tuple[int, int], embedding_model: str = 'google/embeddinggemma-300m', topicname_llm_model: str = 'llama4:scout')[source]

Bases: object

Run topic model and plotting.

fit_model() → DataFrame[source]: Fit BERTopic to text data.

get_topic_names(docs_topics: DataFrame, min_size=100) → None[source]: Use OLLAMA to generate topic names.

plot(load: bool = False) → None[source]: Run embedding, topic modelling and plotting.

plot_map() → None[source]: Generate datamap plot of topic model.