README
Polona Explorer
Polona Explorer is a Python project designed for analyzing historical Polish periodicals, specifically focusing on the concept of oil in Polish press from 1853-1918.
Dataset
This project is associated with the dataset “Petroleum and Press – The Concept of Oil in Polish Periodical Press, 1853–1918” published on Zenodo.
About the Dataset
The dataset consists of Polish-language periodicals that mention petroleum published between 1 January 1853 and 31 December 1918. The periodicals were downloaded together with their metadata from the Polona aggregator in May 2023 in PDF format, based on a keyword search. These PDF files contained OCR’d digital images of physical documents held in archives and libraries and are all in the public domain.
The dataset was processed to create articles specifically pertaining to petroleum by running layout recognition and segmentation, OCR, and formatting the dataset to METS/MODS.
Access the Dataset
Zenodo Record: https://zenodo.org/records/18591713
The full dataset contains ~17,000 zipped newspapers in METS/MODS format with a total file size of approximately 480 GB. A random sample of 75 zipped Polish weekly newspapers in public domain is also available.
Dataset Features
Time Range: 1853-1918
Language: Polish
Format: METS/MODS with OCR
Processing Tools: OCR-D and Eynollah
Source: Polona aggregator
Project Structure
polonaexplorer/
├── src/ # Source code
├── tests/ # Test files
├── example/ # Example usage
├── docs/ # Documentation
├── README.md # This file
└── pyproject.toml # Project configuration
Installation
Install the package in development mode:
pip install -e .
Usage
See the example directory for usage examples.
Basic usage example:
from polonaexplorer.explorer import PolonaExplorer
# Initialize explorer with target words
explorer = PolonaExplorer(
targetwords=["petrol", "oil"],
data_path="/path/to/polona/corpus",
out_path="/path/to/output"
)
# Search for words and extract text
explorer.get_file_stats()
result_path = explorer.generate_dataframe()
Development
For development setup:
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run coverage
coverage run -m pytest
coverage report
Documentation
Documentation is built using Sphinx:
# Install documentation dependencies
pip install -e ".[docs]"
# Build documentation
make -C docs html
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use this dataset or code in your research, please cite:
Kaye, A., & Vogl, M. (2026). Petroleum and Press – The Concept of Oil in Polish Periodical Press, 1853–1918 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.18591713