Using generative AI as part of historical research: three case studies

For a case study would be nice if the case were actually studied…

> had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.

Why would you need weeks of training to use some OCR tool? No comparison to any used alternatives in the article. And only using "unusually legible" isn't that relevant for the… usual cases

> This is basically perfect,

I’ve counted at least 5 errors on the first line, how is this anywhere close to perfection???

Same with translation: first, is this an obscure text that has no existing translation to compare the accuracy to instead of relying on your own poor knowledge? Second, what about existing tools?

> which I hadn’t considered as being relevant to understanding a specific early modern map, but which, on reflection, actually are (the Peter Burke book on the Renaissance sense of the past).

How?

> Does this replace the actual reading required? Not at all.

With seemingly irrelevant books like the previous one, yes, it does, the poor student has a rather limited time budget

benbreen · 7 months ago

I agree, I probably should've gone into more detail on the actual case studies and implications. I may write this up as a more academic article at some point so I have space to do that.

To your point about OCR: I think you'll find that the existing OCR tools will not know where to begin with the 18th century Mexican medical text in the second case study. If you can find one that is able to transcribe that lettering, please do let me know because it would be incredibly useful.

Speaking entirely for myself here, a pretty significant part of what professional historians do is to take a ton of photos of hard-to-read archival documents, then slowly puzzle them out after the fact - not by using any OCR tool (because none of them that I'm aware of are good enough to deal with difficult paleography) but the old fashioned way, by printing them out, finding individual letters or words that are readable, and then going from there. It's tedious work and it requires at least a few days of training to get the hang of.

If anyone wants to get a sense of what this paleography actually looks like, this is something I wrote about back in 2013 when I was in grad school - https://resobscura.blogspot.com/2013/07/why-does-s-look-like...

For those looking for a specific example of an intermediate-difficulty level manuscript in English, that post shows a manuscript of the John Donne poem "A Triple Fool" which gives a sense of a typical 17th century paleography challenge that GPT-4o is able to transcribe (and which, as far as I know, OCR tools can't handle - though please correct me if I'm wrong). The "Sea surgeon" manuscript below it is what I would consider advanced-intermediate and is around the point where GPT-4o, and probably most PhD students in history, gets completely lost.

re: basically perfect, the errors I see are entirely typos which don't change the meaning (descritto instead of descritta, and the like). So yes, not perfect, but not anything which would impact a historical researcher. In terms of existing tools for translation, the state of the art that I was aware of before LLMs is Google Translate, and I think anyone who tries both on the same text can see which works better there.

re: "irrelevant books," there's really no way to make an objective statement about what's relevant and what's not until you actually read something rather than an AI summary. For that reason, in my own work, this is very much about augmenting rather than replacing human labor. The main work begins after this sort of LLM-augmented research. It isn't replaced by it in any way.

eviks · 7 months ago

> To your point about OCR: I think you'll find that the existing OCR tools will not know where to begin with the 18th century Mexican medical text in the second case study. If you can find one that is able to transcribe that lettering, please do let me know because it would be incredibly useful.

My point about OCR is you haven't done any comparison and is now making the same mistake of claiming without any evidence. The most basic one from google translate does "know where to begin", it even doesn't make the "physical" mistake, though makes others. It also does know where to begin with the image from your second post, although it seems worse. And it's not the state of the art, and I don't know what that is for spanish either, but again, that wasn't my point. You do not have a care-free option, to be able to understand that "physical" mistake you'd still need to read the source, which means you still need those days/weeks of training

> none of them that I'm aware of are good enough to deal with difficult paleography

And you haven't demonstrated anything re. difficult paleography for the LLMs in your article either!

> entirely typos which don't change the meaning

First, you'd need to actually demonstrate that, and that would require the full accounting which you haven't done (and no, I don't plan to do that either) This could be a typo in a name or a year, which is bound to have some impact on a historical researcher? He'd try searching for a misspelled name and find nothing while there could've been an interesting connection in some other text?

>translation, the state of the art that I was aware of before LLMs is Google Translate, and I think anyone who tries both on the same text can see which works better there.

Yes, do try it, for example, in Deepl, to see that it's not any worse

> no way to make an objective statement about what's relevant and what's not until you actually read something rather than an AI summary

Sure, but presumably you've done that before making the claim of relevance "on reflection"? So how is it relevant to demand this "human labor" of the students?

carschno · 7 months ago

I wanted to say this, but could not express it as well. I think what your points also reveal is the biggest success factor of ChatGPT: it can do many things that specialised tools have been doing (better), but many ChatGPT users had not known about those tools.

I do understand that a mere user of e.g. OCR tooling does not perform a systematic evaluation with the available tools, although it would be the scientific way to decide for one. For a researcher, however, the lack of knowledge about the tooling ecosystem seems concerning.

Deleted Comment

simonw · 7 months ago

Full quote:

> Granted, Monte had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.

He isn't talking about weeks of training to learn to use OCR software, he means weeks of training to learn to read that handwriting without any assistance from software at all.

eviks · 7 months ago

And this would change how? If you needed to learn to read it before despite being able to use OCR, why would this new tool allow you to not learn anything?

Or, to get back to my original comment, if it's ok to be illiterate, why would you need weeks to learn using an alternative OCR tool?

pjc50 · 7 months ago

Do you know any OCR tools that work on early modern English handwriting?

conjectures · 7 months ago

I used to work for a historical records org. As of 10 years back, OCR was getting humans to transcribe such work. So whatever the limitations of genai, my prior is against there being a perfectly good old fashioned OCR solution to the 'obscure hisotrical handwriting' problem.

carschno · 7 months ago

I would start here: https://www.transkribus.org/

Experts in the field might know more specialized tools, or how to train an actually better Transkribus model without deep technical knowledge required.

> After all (he said, pleadingly) consciousness really is an irreducible interior fortress that refuses to be pinned down by the numeric lens (really, it is!)

I love this line and the “flattening of human complexity into numbers” quote above it. It sums up perfectly how I feel about the whole LLM to AGI hype/debate (even though he’s talking about consciousness).

Everyone who develops a model has to jump through the benchmark hoop which we all use to measure progress but we don’t even have anything approaching a rigorous definition of intelligence. Researchers are chasing benchmarks but it doesn’t feel like we’re getting any closer to true intelligence, just flattening its expression into next token prediction (aka everything is a vector).

voidhorse · 7 months ago

Yeah precisely. Ever since the "brain as computer" metaphor was birthed in the 50s-60s the chief line of attack in the effort to make "intelligent" machines has been to continually narrow what we mean by intelligence further and further until we can divest it of any dependence on humanist notions. We have "intelligent" machines today more as a byproduct of our lowering the bar for what constitutes intelligence than by actually producing anything we'd consider remotely capable of the same ingenuity as the average human being.

afthonos · 7 months ago

I find this take strange. My observation has been the opposite. We used to say it would take human intelligence to play chess. Then Deep Blue came up and we said, no, not like that. Then it was go. Then AlphaGo came up and we said no, not like that. Along the way, it was recognizing images. And then AlexNet came along, and we said no, not like that. Then it was creating art, and then LLMs came along, and we said no, not like that.

I agree a narrowing has happened. But the narrowing is to move us closer to saying "if it's not implemented in a brain, located inside a skull, in a body that was developed by DNA-coded cells replicating in a controlled manner over a period of years, it's not really AI."

There's an emotional attachment to intelligence being what makes us human that causes people to lose their minds when machines approach our intelligence. Machines aren't humans. If we value humanity, we should recognize that distinction—even as machines become intelligent and even sentient.

And we should definitely think twice, or, you know, many many many many more times, before building intelligent machines. But I don't think pretending we're not doing that right now is helpful.

I'd love to read way more stuff like this. There are plenty of people writing about LLMs from a computer science point of view, but I'm much more interested in hearing from people in fields like this one (academic history) who are vigorously exploring applications of these tools.

dr_dshiv · 7 months ago

I’m working with Neo-Latin texts at the Ritman Library of Hermetic Philosophy in Amsterdam (aka Embassy of the Free Mind).

Most of the library is untranslated Latin. I have a book that was recently professionally translated but it has not yet been published. I’d like to benchmark LLMs against this work by having experts rate preference for human translation vs LLM, at a paragraph level.

I’m also interested in a workflow that can enable much more rapid LLM transcriptions and translations — whereby experts might only need to evaluate randomized pages to create a known error rate that can be improved over time. This can be contrasted to a perfect critical edition.

And, on this topic, just yesterday I tried and failed to find English translations of key works by Gustav Fechner, an early German psychologist. This isn’t obscure—he invented the median and created the field of “empirical aesthetics.” A quick translation of some of his work with Claude immediately revealed concept I was looking for. Luckily, I had a German around to validate the translation…

LLMs will have a huge impact on humanities scholarship; we need methods and evals.

Thank you! Have been a big fan of your writing on LLMs over the past couple years. One thing I have been encouraged by over this period is that there are some interesting interdisciplinary conversations starting to happen. Ethan Mollick has been doing a good job as a bridge between people working in different academic fields, IMO.

grobbyy · 7 months ago

A basic problem is they're trained on the Internet, and take on all the biases. Ask any of them so purposed edX to MIT or wrote the platform. You'll get back official PR. Look at a primary source (e.g. public git history or private email records) and you'll get a factual story.

The tendency to reaffirm popular beliefs would make current LLMs almost useless for actual historical work, which often involves sifting fact from fiction.

dmix · 7 months ago

Couldn’t LLMs cite primary sources much the same way as a textbook or Wikipedia? Which is how you circumvent the biases in textbooks and wikipedia summaries?

bandrami · 7 months ago

They can, but they also hallucinate non-existent references:

https://journals.sagepub.com/doi/10.1177/05694345231218454

A raw LLM is a bad tool for citations, because you can't guarantee that their model weights will contain accurate enough information to be citable.

Instead, you should find the primary sources through other means and then paste them into the LLMs to help translate/evaluate/etc, which is what this author is doing.

Almondsetat · 7 months ago

Circumventing the bias would mean providing a uniform sampling of the primary sources, which is not guaranteed to happen

dartos · 7 months ago

This is a showcase of exactly what LLMs are good at.

Handwriting recognition, a classic neural network application, and surfacing information and ideas, however flawed, that one may not have had themselves.

This is really cool. This is AI augmenting human capabilities.

BeefWellington · 7 months ago

Good read on what someone in a specific field considers to have been achieved (rightly or wrongly). It does lead me to wonder how many of these old manuscripts and their translations are in the training set. That may limit its abilities against any random sample that isn't included.

Then again, maybe not; OCR is one of the most worked on problems, so the quality of parsing characters into text maybe shouldn't be as surprising.

Off topic: it's wild to me that in 2025 sites like substack don't apply `prefers-color-scheme` logic to all their blogs.

satisfice · 7 months ago

The intractable problem, here, is that “LLMs are good historians” is a nearly useless heuristic.

I’m not a historian. I don’t speak old spanish. I am not a domain expert at all. I can’t do what the author of this post can do: expertly review the work of an LLM in his field.

My expertise is in software testing, and I can report that LLMs sometimes have reasonable testing ideas— but that doesn’t mean they are safe and effective when used for that purpose by an amateur.

Despite what the author writes, I cannot use an LLM to get good information about history.

This is similar to the problem with some of the things people have been doing with o1 and o3. I've seen people share "PhD level" results from them... but if I don't have a PhD myself in that subject it's almost impossible for me to evaluate their output and spot if it makes sense or not.

I get a ton of value out of LLMs as a programmer partly because I have 20+ years of programming experience, so it's trivial for me to spot when they are doing "good" work as opposed to making dumb mistakes.

I can't credibly evaluate their higher level output in other disciplines at all.

xigency · 7 months ago

This begs the question, is this wave of LLM AI anything more than a fancy mirror? They're certainly very good at agreeing with people and following along, but, as many have noted, not really useful for anything acting on their own.

amelius · 7 months ago

You __can__ get good information from an LLM, however you just have to backtrack every once in a while because the information turned out to be false.

userbinator · 7 months ago

however you just have to backtrack every once in a while because the information turned out to be false.

The problem is, how do you know? I've seen developers go completely off-course just from bad search engine results and one did admit he felt something wasn't right but kept going because he didn't know better; now imagine he's being told by a very confident but incorrect LLM, and you can see how hazardous that'll be.

"You don't know what you don't know."

mvdtnz · 7 months ago

And therein lies the problem - if you're not already an expert there's no way to tell when is the right moment to backtrack.

nithril · 7 months ago

The exact definition of a useful heuristic, "good enough"

jolmg · 7 months ago

> explicación poética

> There are, again, a couple errors here: it should be “explicación phisica” [physical explanation] not “poetic explanation” in the first line, for instance.

The image seems to say "phicica" (with a "c"), but that's not Spanish. "ph" is not even a thing in Spanish. "Physical" is "física", at least today, IDK about the 1700's. So, if you try to make sense of it in such a way that you assume a nonsense word is you misreading rather than the writer "miswriting", I can see why it assumes it might say "poética", even though that makes less sense semantically.

Author here, I agree that my read may not be correct either. It’s tough to make out. Although keep in mind that “ph” is used in Latin and Greek (or at least transliterations of Greek into the Roman alphabet) so in an early modern medical context (I.e. one in which it is assumed the reader knows Latin, regardless of the language being used) “ph” is still a plausible start to a word. Early modern spelling in general is famously variable - common to see an author spell the same word two different ways in the same text.

> So, if you try to make sense of it in such a way that you assume a nonsense word is you misreading

> I agree that my read may not be correct either

Just in case, by "you", I meant from the POV of the AI, not you the author.

That's interesting to know about "ph". I didn't know it was present in Latin, and I wonder if that's also the case with Spanish.

throwup238 · 7 months ago