.. toctree:: :maxdepth: 2 :caption: Contents: readme modules coverage polonaexplorer is a Python library designed to explore the Polona corpus - a Polish historical newspaper collection. It provides functionality to search for specific words within the corpus and extract relevant text data for further analysis or visualization. .. image:: _static/polonaexplorer_example.png :alt: Example visualization of topic modeling results :align: center Key Features ------------ * Search for target words in the Polona corpus * Extract text regions or complete pages containing target words * Generate structured dataframes for analysis * Perform topic modeling using BERTopic * Create interactive visualizations of topic distributions * Support for LLM-based topic naming Getting Started --------------- Installation ~~~~~~~~~~~~ Install the package using pip:: pip install polonaexplorer Or for development:: pip install -e . Usage ~~~~~ The typical workflow has two stages: first, use :class:`~polonaexplorer.explorer.PolonaExplorer` to find and extract text from the corpus, then use :class:`~polonaexplorer.plotter.Plotter` to perform topic modeling and visualization. **1. Initialize the explorer** Create a :class:`~polonaexplorer.explorer.PolonaExplorer` instance with the target words you want to search for. The explorer automatically generates all morphological forms of each word using the Morfeusz2 Polish morphological analyzer:: from polonaexplorer.explorer import PolonaExplorer explorer = PolonaExplorer( targetwords=["naród", "wolność"], data_path="/path/to/polona2/corpus", out_path="/path/to/output", metadata_file_path="/path/to/metadata.json", part="region", # or "page" ) The ``part`` parameter controls the extraction mode: - ``"page"``: collects full page texts that contain at least one target word. Produces a CSV file listing matching file paths. - ``"region"``: extracts individual text regions (from PAGE XML) around each target word, plus per-word frequency statistics. Produces two JSON files (``word_stats.json`` and ``word_surroundings.json``). **2. Generate the file list and word statistics** Scan the corpus for all occurrences of the target words:: explorer.get_file_stats() Depending on the ``part`` mode, this creates: - ``part="page"``: ``files_with_target_words.csv`` in the output directory. - ``part="region"``: ``word_stats.json`` (word frequencies per file) and ``word_surroundings.json`` (extracted text regions) in the output directory. The search runs in parallel across all available CPU cores. **3. Generate the merged dataframe** Combine the extracted texts with publication metadata (date, title, publisher, etc.):: result_path = explorer.generate_dataframe() This produces a JSON Lines file (e.g. ``polona_matching_text_region.json``) that merges the text data with the metadata from the archive. **4. Visualize with topic modeling** Pass the generated dataframe to the :class:`~polonaexplorer.plotter.Plotter` for topic modeling and interactive visualization:: from polonaexplorer.plotter import Plotter plotter = Plotter( data_path="/path/to/output/polona_matching_text_region.json", out_path="/path/to/output", year_range=(1890, 1920), ) plotter.plot() This runs the full pipeline: embedding generation, UMAP + HDBSCAN clustering, BERTopic fitting, LLM-based topic labeling (via a local Ollama instance), and produces an interactive HTML datamap. Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`