Master's Thesis
Objective
Document thesis progress, methodological choices, and writing milestones.
Done
- downloaded a sub-corpus from the general press
- Gallica PDF scraping pipeline (mostly scientific journals)
- first OCR script with Tesseract and column reconstruction
Current state & Next steps
- Finish scraping, corpus structuring, and OCR processing.
- Finalize corpus scope.
- Active repositories: scraping_pdf and transcription.
- Plan, historiography, technical bibliography, problem statement, writing, etc.
- Structuring: move from OCR-processed sub-corpora to article-level units.
Progress log
- 26/02/2026: created the project page and initial structure.
- 18/03/2026: finished the PDF scraping pipeline using Gallica APIs and Selenium. Launched scripts in background. Scientific sub-corpus now includes around 50 journals from diverse fields (agriculture, medicine, industry, biology, etc.).
Comments