polonaexplorer is a Python library designed to explore the Polona corpus - a Polish historical newspaper collection. It provides functionality to search for specific words within the corpus and extract relevant text data for further analysis or visualization.

Example visualization of topic modeling results

Key Features

  • Search for target words in the Polona corpus

  • Extract text regions or complete pages containing target words

  • Generate structured dataframes for analysis

  • Perform topic modeling using BERTopic

  • Create interactive visualizations of topic distributions

  • Support for LLM-based topic naming

Getting Started

Installation

Install the package using pip:

pip install polonaexplorer

Or for development:

pip install -e .

Usage

The typical workflow has two stages: first, use PolonaExplorer to find and extract text from the corpus, then use Plotter to perform topic modeling and visualization.

1. Initialize the explorer

Create a PolonaExplorer instance with the target words you want to search for. The explorer automatically generates all morphological forms of each word using the Morfeusz2 Polish morphological analyzer:

from polonaexplorer.explorer import PolonaExplorer

explorer = PolonaExplorer(
    targetwords=["naród", "wolność"],
    data_path="/path/to/polona2/corpus",
    out_path="/path/to/output",
    metadata_file_path="/path/to/metadata.json",
    part="region",  # or "page"
)

The part parameter controls the extraction mode:

  • "page": collects full page texts that contain at least one target word. Produces a CSV file listing matching file paths.

  • "region": extracts individual text regions (from PAGE XML) around each target word, plus per-word frequency statistics. Produces two JSON files (word_stats.json and word_surroundings.json).

2. Generate the file list and word statistics

Scan the corpus for all occurrences of the target words:

explorer.get_file_stats()

Depending on the part mode, this creates:

  • part="page": files_with_target_words.csv in the output directory.

  • part="region": word_stats.json (word frequencies per file) and word_surroundings.json (extracted text regions) in the output directory.

The search runs in parallel across all available CPU cores.

3. Generate the merged dataframe

Combine the extracted texts with publication metadata (date, title, publisher, etc.):

result_path = explorer.generate_dataframe()

This produces a JSON Lines file (e.g. polona_matching_text_region.json) that merges the text data with the metadata from the archive.

4. Visualize with topic modeling

Pass the generated dataframe to the Plotter for topic modeling and interactive visualization:

from polonaexplorer.plotter import Plotter

plotter = Plotter(
    data_path="/path/to/output/polona_matching_text_region.json",
    out_path="/path/to/output",
    year_range=(1890, 1920),
)
plotter.plot()

This runs the full pipeline: embedding generation, UMAP + HDBSCAN clustering, BERTopic fitting, LLM-based topic labeling (via a local Ollama instance), and produces an interactive HTML datamap.

Indices and tables