With 1000 rows and 100 samples and markdown-kv, I got these scores:
- gpt-4.1-nano: 52%
- gpt-4.1-mini: 72%
- gpt-4.1: 93%
- gpt-5: 100%
I was so surprised by gpt-5 getting 100% that I ran it again with 1000 samples. It got 999 correct, and one wrong.
To reproduce it yourself, clone the repo, add a .env file with OPENAI_API_KEY, `uv sync`, and then run:
uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv --model openai/gpt-5 --limit 100
Update: Also, number of rows makes a massive difference, unsurprisingly; at 100 rows, gpt-4.1-nano scores 95%+ for both markdown-kv and csv. Both model and record count seem to matter a lot more than format.
Not to mention that the least poorly performing format is probably the stupidest way to encode tabular data, beating even XML. But I guess that’s the new normal because we’re trying to shoehorn conversational AI models to every use case rather than, say, training finetunes that are better at particular tasks. (Yes, of course you can’t train finetunes when the model is a proprietary black box on someone else’s computer.) Something about hammers and nails…
With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases.
For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.
Thanks for your work on this! It's a very legit domain of problem for LLMs to optimize for. I've produced a comprehensive eval based on your post and run it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1
And if you have no control over model, then use CSV or Markdown Table.
> As you increase the size of the input data, the accuracy gradually decreases.
Interesting.
On your section "Limitations and Areas for Further Study",
What I'd be curious on future work would be,
- changing the order of the data on each table type
- changing the order of the questions
I'm curious to know if what it fails is the same, if it changes depending on the location, if it's a bias.
Is it always a specific question? Is it always a specific value? Is it always question #x (or around question #x?). Does it tend towards x or y on types of questions?
Thank you for including the tokens needed for each test.
It looks to me that the concisest way of representing each of these tables was a CSV and then a standard markdown table. The amount of tokens appears to be 1/2 or 1/3 of the other options. For experiments not in mice (GPT-4.1-nano), but in larger models or larger context aside from the data table itself, my guess is that preserving context is might be higher value than having the higher-LLM-legibility of the Markdown-KV.
The best performing isn't markdown tables, it's markdown key/value pairs:
## Record 1
```
id: 1
name: Charlie A0
age: 56
city: New York
department: Operations
salary: 67896
years_experience: 7
project_count: 1
```
Which makes sense to me because the problem with formats like CSV and regular markdown tables is that it is too easy for the model to mistakenly associate a value in a row with the wrong header.
Explicit key/value formats like this or YAML or JSON objects make that a lot less likely.
I was looking for the frontier curve where they tested their benchmark across different models since this sort of behavior is highly parameter, architecture, training, and fine tuning sensitive. It’s a practically useful question so I was really disappointed when a) they didn’t publish their code so you could test yourself, b) they didn’t do even a cursory examination of other models and sizes.
This should be higher. While the research question is interesting, the sample size makes the conclusion highly suspect. I'd like to see more research on this.
The test really needed to be run on multiple data sizes (50, 100, 500, 1000, 5000). The more token efficient formats would probably eventually overtake the token heavy ones due to context pollution. All this test really says is what performs best for 1 particular model at one particular context length.
Interesting. Curious to reproduce across models, I made a comprehensive eval based on your post and ran it against 30 models, each tasked with recalling specific data from 500 rows in different tabular formats. Have a look at the results here: https://weval.org/analysis/table-format-sensitivity__combine...
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing at basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
IMO the biggest takeaway really is: Use the best model you can reasonably afford, then the format chosen will matter less. The cheapest 100%-coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1 FWIW. However, if you have no control over model, then use CSV or Markdown Table as these have highest chance of success.
The MAJOR issue that we might not want to admit is that there are a thousand confounders that prevent any meaningful canonical learning here. Crucially: The data within the tabular structure itself matters HUGELY. The scary probabilistic nature of LLMs mean the very subject of your queries can affect how the query is run, which is quite absurd from a IO/computing purity perspective. This is why tooling is so important. Enable the LLM to write and execute code safely, and you don't need to worry about such free-prose frailties.
Bizarre conclusions when on average all the formats perform poorly with average accuracy of 50%. Sure 60% is better than 40% but they are both unusable if you actually care about numbers...
I've been stunned by how many smart people talk so casually about LLMs becoming better at math. Do they just forget that a calculator that is wrong 1% of the time is a de facto calculator that doesn't work and should not be used?
Doing math is not the same as calculating. LLMs can be very useful in doing math; for calculating they are the wrong tool (and even there they can be very useful, but you ask them to use calculating tools, not to do the calculations themselves—both Claude and ChatGPT are set up to do this).
If you're curious, check out how mathematicians like Robert Ghrist or Terence Tao are using LLMs for math research, both have written about it online repeatedly (along with an increasing number of other researchers).
Apart from assisting with research, their ability on e.g. math olympiad problems is periodically measured and objectively rapidly improving, so this isn't just a matter of opinion.
You realize that when typing into a calculator, you probably hit a wrong key more than 1% of the time? Which is why you always type important calculations twice?
I've been stunned by how many smart people talk so casually about how because LLMs aren't perfect, they therefore have no value. Do they just forget that nothing in the world is perfect, and the values of things are measured in degrees?
I intentionally chose input data large enough that the LLM would be scoring in the region of 50% accuracy in order to maximise the discriminative power of the test.
Yeah I mean for many real world scale datasets you don’t want to blow the whole context window on a massive markdown file. Instead you can provide a tool that presents the data as a SQLite database. In my testing Claude code seems very capable of answering questions via SQLite queries or even `head` and `grep` on CSV files.
But the result from the SQL query is going to be... a table. So at some point, tables need to go into context, and we need to know how well LLMs can incorporate those tables.
This was exactly my thought. Rather than feed the table directly to the LLM, build agents that extract the data and have the LLM act on the extracted data items. Then it’s a preference issue.
The author didn’t see much more than 60% accuracy which is not very useful for many (most?) real world tasks.
Well, ironically you then have the issue of how to present your database schema (including important things like the values in some categorical fields) to the LLM and in what format, so you never really escape this issue.
Great benchmark! It highlights an important but often downstream problem. In real-world pipelines, the bigger issue comes before this: extracting tables from PDFs or scans without breaking their layout. Once the structure is lost (merged headers, nested cells, footnotes, etc.), no data format can fully recover it.
Check out LLMWhisperer from Unstract —> it preserves table and layout fidelity when converting documents for LLM use. You can try it on complex PDFs or forms here: https://pg.llmwhisperer.unstract.com (no signup needed)
Layout preservation upstream often improves downstream accuracy more than choosing between CSV, JSON, or Markdown. Find more details here: https://unstract.com/llmwhisperer/
With 1000 rows and 100 samples and markdown-kv, I got these scores:
- gpt-4.1-nano: 52%
- gpt-4.1-mini: 72%
- gpt-4.1: 93%
- gpt-5: 100%
I was so surprised by gpt-5 getting 100% that I ran it again with 1000 samples. It got 999 correct, and one wrong.
To reproduce it yourself, clone the repo, add a .env file with OPENAI_API_KEY, `uv sync`, and then run:
Update: Also, number of rows makes a massive difference, unsurprisingly; at 100 rows, gpt-4.1-nano scores 95%+ for both markdown-kv and csv. Both model and record count seem to matter a lot more than format.Unfortunately I started getting "quota exceeded" almost immediately, but it did give 6/6 correct answers before it crapped out.
> accuracy: 60%
Not to mention that the least poorly performing format is probably the stupidest way to encode tabular data, beating even XML. But I guess that’s the new normal because we’re trying to shoehorn conversational AI models to every use case rather than, say, training finetunes that are better at particular tasks. (Yes, of course you can’t train finetunes when the model is a proprietary black box on someone else’s computer.) Something about hammers and nails…
To explain the 60% a bit more...
With small amounts of input data, the accuracy is near 100%. As you increase the size of the input data, the accuracy gradually decreases.
For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing a basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
So, the biggest takeaway really is: Use the best model you can reasonably afford, then format will matter less. The cheapest 100% coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1And if you have no control over model, then use CSV or Markdown Table.
Interesting.
On your section "Limitations and Areas for Further Study",
What I'd be curious on future work would be,
I'm curious to know if what it fails is the same, if it changes depending on the location, if it's a bias.Is it always a specific question? Is it always a specific value? Is it always question #x (or around question #x?). Does it tend towards x or y on types of questions?
Good idea
It looks to me that the concisest way of representing each of these tables was a CSV and then a standard markdown table. The amount of tokens appears to be 1/2 or 1/3 of the other options. For experiments not in mice (GPT-4.1-nano), but in larger models or larger context aside from the data table itself, my guess is that preserving context is might be higher value than having the higher-LLM-legibility of the Markdown-KV.
Explicit key/value formats like this or YAML or JSON objects make that a lot less likely.
> We only tested OpenAI’s GPT-4.1 nano.
As you can see it's near 100% recall across all formats for a good chunk of frontier models, with a few (curiously, mostly Claude) failing at basic prompt adherance ("Return just the number") but still returning the right answers. The major failures are from Mistral Medium, Llama Maverick, Llama 3 70b Instruct, Mistral Nemo, Gemma 3 12b It, GPT 4o/4.1 Mini etc.
Based on these limited tests, here's the leaderboards on formats FWIW:
IMO the biggest takeaway really is: Use the best model you can reasonably afford, then the format chosen will matter less. The cheapest 100%-coverage models are Gemini 2.5 Flash and Deepseek Chat V3.1 FWIW. However, if you have no control over model, then use CSV or Markdown Table as these have highest chance of success.The MAJOR issue that we might not want to admit is that there are a thousand confounders that prevent any meaningful canonical learning here. Crucially: The data within the tabular structure itself matters HUGELY. The scary probabilistic nature of LLMs mean the very subject of your queries can affect how the query is run, which is quite absurd from a IO/computing purity perspective. This is why tooling is so important. Enable the LLM to write and execute code safely, and you don't need to worry about such free-prose frailties.
Could they be referring to this?
"Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad" https://deepmind.google/discover/blog/advanced-version-of-ge...
If you're curious, check out how mathematicians like Robert Ghrist or Terence Tao are using LLMs for math research, both have written about it online repeatedly (along with an increasing number of other researchers).
Apart from assisting with research, their ability on e.g. math olympiad problems is periodically measured and objectively rapidly improving, so this isn't just a matter of opinion.
Yes LLMs suck at calculating stuff. However they can manipulate equations and such, and sometimes impressively so.
I've been stunned by how many smart people talk so casually about how because LLMs aren't perfect, they therefore have no value. Do they just forget that nothing in the world is perfect, and the values of things are measured in degrees?
Deleted Comment
To hopefully clarify a bit...
I intentionally chose input data large enough that the LLM would be scoring in the region of 50% accuracy in order to maximise the discriminative power of the test.
The author didn’t see much more than 60% accuracy which is not very useful for many (most?) real world tasks.
Check out LLMWhisperer from Unstract —> it preserves table and layout fidelity when converting documents for LLM use. You can try it on complex PDFs or forms here: https://pg.llmwhisperer.unstract.com (no signup needed)
Layout preservation upstream often improves downstream accuracy more than choosing between CSV, JSON, or Markdown. Find more details here: https://unstract.com/llmwhisperer/