My research often involves transcribing handwritten deeds, wills and other historical documents. It’s a painstaking task made even more difficult when documents and scanned images of documents have poor penmanship, are faded or damaged.
I’m finding artificial intelligence (AI) applications increasingly helpful in other areas. See especially the articles “Lafayette’s 1825 Visit to Northern Virginia” and “History of Virginia’s Culpeper Basin.” So I decided to test AI in transcribing a handwritten 1866 deed with Transkribus optical character recognition (OCR).
Transkribus is an online application that uses OCR and AI algorithms to transcribe handwritten characters into digital text. It faces the same challenges I do in dealing with faded, poorly written and otherwise difficult to read documents. Transkibus offers free credits to process a limited number of pages, and paid plans for higher use.
The deed tested divides land in the Gainesville area of Prince William County among the heirs of Judge John Webb Tyer of Warrenton, who died in 1862. I downloaded it from the county court’s Self Service Historical Records portal The scanned images of the deed’s three pages are clear and the handwriting mostly readable.
As a first timer using Transkribus the interface was challenging, so I turned to my favorite AI search assistant Perplexity to coach me through my test. I asked: “in Transkribus how to get started processing pdfs online with new account and first document?” It gave me the steps to take, a key one being to choose an OCR model to process the document. I then asked: “what models are good for 19th century American english?” Perplexity suggested three models and some tips for choosing models. I chose the B2022 English Model M4.
The three images were each processed in under one minute. I was very disappointed in the output result with its mix of line numbers and many misspelled and run-together words. It is much less intelligible to my reading than the images of the handwritten deed. I opened the images on my computer and started cleaning up the text word by word. By the fourth line, I realized I could more quickly create a transcription from scratch than by editing the OCR output text.
I remembered reading the now-forgotten article that inspired me to try Transkribus. The author reported a second step using another AI application, ChatGPT, to clean up OCR-created text. So I turned to Google Gemini Pro, copied the OCR’d text from Transcribus into a Google Docs file. My one fix was editing misspellings of “Tyler” in the first lines. I gave Gemini this prompt: “Review and edit the text to fix the numerous misspellings resulting from OCR. Ignore the sequential line numbers separating the lines of text and remove blank lines.”
The result is astonishing to me, a near perfect (97.3%) transcript of the deed. Just 33 errors and omissions in the transcript’s 1214 words. One hallucination occurred: Brentsville, not mentioned in the deed, was added to a list of three other communities.
A Second Test
After reading this Reddit post, I decided to try another AI OCR service. The post reports tests and comparisons of several handwriting OCR services. Handwriting OCR seemed the most promising based on the author’s evaluations. Like Transkribus it is an online service. Five pages can be processed for free; after that paid plans are available. The steps for uploading and processing documents are simple and easily understood without outside help. I uploaded the same three-page deed as in the first test. It processed quickly and the output text was available to download in several file formats.
Comparing the Two Tests
Comparing two tests, the accuracy was: Handwriiting OCR (92%); Transkribus output cleaned up by Gemini Pro (97%). However, the Handwriiting OCR text was immediately usable without the many run-together and non-words Transkribus produced. It was legible and easy to check accuracy and edit by sight-comparing with the deed page images.
I used my two remaining free pages at Handwriiting OCR to OCR Judge Tyler’s will. I thought its two-column page format and handwriting style would be more challenging than the deed and likely produce less accurate results. The output accuracy was better, 96 percent; just 33 of the 789 words required correction.
Based on my tests of two handwriting OCR services, I will use this AI tool regularly to process handwritten historical documents faster and with less effort.
Have you tried any of the AI OCR services? What is your experience?