- Log in to post comments
Definition of the Problem
The goal of this project is to develop an open-source handwriting recognition (HTR) and optical character recognition (OCR) workflow tool designed to streamline the translation and transcription of a large collection of historical documents from the Native Bound Unbound (NBU) Archive of Indigenous Slavery database. Using modern machine learning techniques, the tool will automate the process of transcribing and translating these materials, making them more accessible for research and analysis. In addition to transcription, the project will integrate paleographic analysis to extract key data structures such as people, places, and dates. Beyond this, the tool will incorporate a linguistic analysis component to identify recurring patterns in language use and writing style. These insights will help uncover the strategic ways language was used historically, for example, how writers encoded meaning, negotiated identity, or signaled power relationships within the text.
Purpose
The transcription and translation of historical manuscripts are traditionally slow and labor-intensive processes. By applying deep learning and natural language processing (NLP) techniques, this project seeks to greatly reduce the time and effort required for these tasks. The completed workflow will be made freely available to researchers at Native Bound Unbound, where I currently intern, to accelerate their archival work and enable deeper linguistic, historical, and cultural analysis.
Plan of Action
Over the next few weeks, I will begin by collecting a sample set of handwritten documents from the NBU database and preprocessing them for analysis. Currently, to gain some intuition into the structure of the documents themselves, I will start transcription work in the From the Page platform. After that , I will then build a baseline OCR and handwriting recognition model using Python libraries such as Tesseract and EasyOCR, followed by testing its accuracy on various document types.