I like Transkribus a lot and it is extremely helpful to get quick transcriptions, especially when the models are trained well. It will never get to 100%, but manual intervention is always needed. And Transkribus is a really, realls well-thought out piece of software, even though its heavy dependencies on Java make it slow, especially on 50+ page documents.
However, i never liked their move from a research project to a commercial model. Their signup has plenty of credits for an individual who just wants to edit their family documents, but i still think it should be a bit more lenient for personal use.
Thankfully there is eScriptorium. Even if it is still in early development it is a more user-friendly alternative to Transkribus.
https://gitlab.com/scripta/escriptorium
1. Indeed, a publicly funded European research project turned commercial software (closed source?) that expects institutions to pay for annual fees. Hmm. I understand OSS still needs constributors and has ongoing maintenance costs, but couldn’t there be a more efficient way? It had a very German academic feel to me (nofi, and indeed it’s an endeavor started at 4 German universities.)
2. The blogpost almost reads like a nineties description of the value of IT. (Fun read and perspective though! This is the positivist approach to history that underpins many interpretive histories of the future. Great and underestimated work.) The whole point of computer and user augmenting each other continuously somewhat falls short with the author saying how impressive, but fallible students and computers are. Along the lines of “okay, the output is x 1000 and of pretty good quality, but it’s not professional academic quality”. When I think of chess, or poker: computers have given people /new ways of studying/ even before applying. I think that point is still missed here. The software should point out mistakes by the students in training, while it learns by the additional input. That is the virtuous cycle of continuous improvement.
And 3. Things like the scientific R and Python ecosystems, or like Stan have shown me the power of creating open source tools for other use. Like Andrew Gelman, who has remarked multiple times that he never could have expected the use cases Stan has now. (There are Bayesian sport scientist now..!) Teach people and give them tools, but don’t dictate the entire workflow. Please let outsiders have a chance of swapping models, doing proof of concepts etc.
The Roma Tre University has a research project named In Codice Ratio. One of its objectives is to transcribe through AI and OCR the whole Vatican Secret Archives - one of the biggest collection of manuscripts, some of them more than a thousand years old.
The fact that older manuscript collections will have multiple texts in a single codex or scroll with little annotation to the existence of these further texts makes even generating a list of what’s in the collection a challenge. This gives me some hope that a complete catalog of the Vatican Secret¹ Archives might exist in my lifetime.
⸻
1. “Secret,” here, does not mean what you might assume, but rather refers to the fact that the archive belongs to the Pope personally and not some department of the curia.
"Based on an input of human transcriptions – a few dozen pages will suffice – the computer develops a reading model that can be more precise, and that will certainly be quicker, than handing the transcription work out to humans. "
I really like the idea of per-document OCR models.
So much of Europe's culture and history is locked in latin books that have been scanned onto the archive but not successfully OCRed. Progress on this front would be nice! (As someone ignorant of both European and Chinese paleography, China's historical text digitization projects seem to be far in advance of Europe's - I was able download 200k classical chinese poems like it's nothing in a nicely formatted json archive on GitHub, for instance).
I have a friend who does transcription, and says that she can recognize the handwriting of people like Thomas Aquinas and other Medieval figures. She works for some research group that has microfilms from a large cache of documents that are kept in some vault in France.
For those who don't know, Thomas Aquinas has a reputation of having tricky to read handwriting https://www.reddit.com/r/latin/comments/mbbjcw/this_is_said_... (I remember trying to find some description of his shorthand/writing style, but couldn't find anything - people love talking about how hard it is to read but never try explain how to read it...), but more generally medieval manuscripts have a large number of shorthandy abbreviations which throw up lots of problems for naive OCR-ing/reading, even if you know the language in question.
However, i never liked their move from a research project to a commercial model. Their signup has plenty of credits for an individual who just wants to edit their family documents, but i still think it should be a bit more lenient for personal use.
Thankfully there is eScriptorium. Even if it is still in early development it is a more user-friendly alternative to Transkribus. https://gitlab.com/scripta/escriptorium
1. Indeed, a publicly funded European research project turned commercial software (closed source?) that expects institutions to pay for annual fees. Hmm. I understand OSS still needs constributors and has ongoing maintenance costs, but couldn’t there be a more efficient way? It had a very German academic feel to me (nofi, and indeed it’s an endeavor started at 4 German universities.)
2. The blogpost almost reads like a nineties description of the value of IT. (Fun read and perspective though! This is the positivist approach to history that underpins many interpretive histories of the future. Great and underestimated work.) The whole point of computer and user augmenting each other continuously somewhat falls short with the author saying how impressive, but fallible students and computers are. Along the lines of “okay, the output is x 1000 and of pretty good quality, but it’s not professional academic quality”. When I think of chess, or poker: computers have given people /new ways of studying/ even before applying. I think that point is still missed here. The software should point out mistakes by the students in training, while it learns by the additional input. That is the virtuous cycle of continuous improvement.
And 3. Things like the scientific R and Python ecosystems, or like Stan have shown me the power of creating open source tools for other use. Like Andrew Gelman, who has remarked multiple times that he never could have expected the use cases Stan has now. (There are Bayesian sport scientist now..!) Teach people and give them tools, but don’t dictate the entire workflow. Please let outsiders have a chance of swapping models, doing proof of concepts etc.
The code hasn't been released (yet) but you can find some preliminary results here: http://www.inf.uniroma3.it/db/icr/preliminary-results.html
⸻
1. “Secret,” here, does not mean what you might assume, but rather refers to the fact that the archive belongs to the Pope personally and not some department of the curia.
I really like the idea of per-document OCR models.
So much of Europe's culture and history is locked in latin books that have been scanned onto the archive but not successfully OCRed. Progress on this front would be nice! (As someone ignorant of both European and Chinese paleography, China's historical text digitization projects seem to be far in advance of Europe's - I was able download 200k classical chinese poems like it's nothing in a nicely formatted json archive on GitHub, for instance).
Weird how the brain jumps to "term I know" rather than actually reading a word.