Master's Thesis

Status: in progress

Last update: 18/03/2026

Progress: Early phase

Objective

Document thesis progress, methodological choices, and writing milestones.

Done

downloaded a sub-corpus from the general press
Gallica PDF scraping pipeline (mostly scientific journals)
first OCR script with Tesseract and column reconstruction

Current state & Next steps

Finish scraping, corpus structuring, and OCR processing.
Finalize corpus scope.
Active repositories: scraping_pdf and transcription.
Plan, historiography, technical bibliography, problem statement, writing, etc.
Structuring: move from OCR-processed sub-corpora to article-level units.

Progress log

26/02/2026: created the project page and initial structure.
18/03/2026: finished the PDF scraping pipeline using Gallica APIs and Selenium. Launched scripts in background. Scientific sub-corpus now includes around 50 journals from diverse fields (agriculture, medicine, industry, biology, etc.).

Source code · Source code 2

Comments