I think it very likely that gpt-4o was trained on this. I mean, why would you not? Innnput, innnput, Johnny five need more tokens.
I wonder why the NIAN team don't generate their limericks using different models, and check to make sure they're not in the dataset? Then you'd know the models couldn't possibly be trained on them.
I tested the LLMs to make sure they could not answer the questions unless the limerick was given to them. Other than 4o, they do very badly on this benchmark, so I don't think the test is invalidated by their training.
Why wouldn't it still be invalidated by it if it was indeed trained on it? The others may do worse and may or may not have been trained on it, but them failing on ititself doesn't imply 4o can do this well without the task being present in the corpus.
It would be interesting to know how it acts if you ask it about one that isn't present, or even lie to it (e.g. take a limerick that is present but change some words and ask it to complete it)
Maybe some models hallucinate or even ignore your mistake vs others correcting it (depending on the context ignoring or calling out the error might be the more 'correct' approach)
We have run that test.- generate random string(not by llm) names of values- ask the llm to do math (algebra) using those strings. Tests logic, 100% not in the data set GPT2 was like 50% accurate, now we up around the 90%.
NIAN is a very cool idea, but why not simply translate it into N different languages (you even can mix services, e.g. deepl/google translate/LLMs themselves) and ask about them that way?
I just used it to compare two smaller legal documents and it completely hallucinated that items were present in one and not the other. It did this on three discrete sections of the agreements.
Using ctrl-f I was able to see that they were identical in one another.
Obviously this is a single sample but saying 90% seems unlikely. They were around ~80k tokens total.
I have the same feeling. I asked to find duplicates in a list of 6k items and it basically hallucinated the entire answer multiple times. Some times it finds some, but it interlaces the duplicates with other hallucinated items. I wasn't expecting it to get it right, cause I think this task is challenging with a fixed amount of attention heads. However, the answer seems much worse than Claude Opus or GPT-4.
I would note that LLMs handle this task better if you slice the two documents into smaller sections and iterate section by section. They aren’t able to reason and have no memory so can’t structurally analyze two blobs of text beyond relatively small pieces. But incrementally walking through in much smaller pieces that are themselves semantically contained and related works very well.
The assumption that they are magic machines is a flawed one. They have limits and capabilities and like any tool you need to understand what works and doesn’t work and it helps to understand why. I’m not sure why the bar for what is still a generally new advance for 99.9% of developers is effectively infinitely high while every other technology before LLMs seemed to have a pretty reasonable “ok let’s figure out how to use this properly.” Maybe because they talk to us in a way that appears like it could have capabilities it doesn’t? Maybe it’s close enough sounding to a human that we fault it for not being one? The hype is both overstated and understated simultaneously but there have been similar hype cycles in my life (even things like XML were going to end world hunger at one point).
That's a different test than needle-in-a needlestack, although telling in how brittle these models are - competent in one area, and crushingly bad in others.
Needle-in-a-needlestack contrasts with needle-in-a-haystack by being about finding a piece of data among similar ones (e.g. one specific limeric among thousands of others), rather than among disimilar ones.
And also article is testing on a different task (Needle in a Needlestack which is kind of similar to Needle in a Haystack), compared to finding a difference between two documents. For sure it's useful to know that the model does ok in one and really bad in the other, does not mean that original test is flawed.
Yeah I asked for an estimate of the percentage of the US population that lives in the DMV area (DC, Maryland, Virginia) and it was off by 50% of the actual answer, which I only realized when I realized I shouldn’t trust its estimate for anything important
The needle in the haystack test gives a very limited view of the model’s actual long context capabilities. It’s mostly used because early models were terrible at it and it’s easy to test. In fact, most recent models now do pretty good at this one task, but in practice, their ability to do anything complex drops off hugely after 32K tokens.
> Despite achieving nearly perfect performance on the vanilla needle-in-a-haystack (NIAH) test, all models (except for Gemini-1.5-pro) exhibit large degradation on tasks in RULER as sequence length increases.
> While all models claim context size of 32k tokens or greater (except for Llama3), only half of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama2-7b performance at 4K (85.6%). The performance exceeding the threshold is underlined.
I'd like to see this for Gemini Pro 1.5 -- I threw the entirety of Moby Dick at it last week, and at one point all books Byung Chul-Han has ever published, and it both cases it was able to return the single part of a sentence that mentioned or answered my question verbatim, every single time, without any hallucinations.
A number of people in my lab do research into long context evaluation of LLMs for works of fiction. The likelihood is very high that Moby Dick is in the training data. Instead the people in my lab have explored recently published books to avoid these issues.
I’m not involved in the space, but it seems to me that having a model, in particular a massive model, exposed to a corpus of text like a book in the training data would have very minimal impact. I’m aware that people have been able to return data ‘out of the shadows’ pf the training data but to my mind a model being mildly influenced by the weights between different words in this text hardly constitute hard recall, if anything it now ‘knows’ a little of the linguistic style of the authour.
But this content is presumably in its training set, no? I'd be interested if you did the same task for a collection of books published more recently than the model's last release.
To test this hypothesis, I just took the complete book "Advances in Green and Sustainable Nanomaterials" [0] and pasted it into the prompt, asking Gemini: "What absorbs thermal radiations and converts it into electrical signals?".
It replied: "The text indicates that graphene sheets present high optical transparency and are able to absorb thermal radiations with high efficacy. They can then convert these radiations into electrical signals efficiently.".
Gemini works with brand new books too; I've seen multiple demonstrations of it. I'll try hunting one down. Side note: this experiment is still insightful even using model training material. Just compare its performance with the uploaded book(s) to without.
I would hope that Byung-Chul Han would not be in the training set (at least not without his permission), given he's still alive and not only is the legal question still open but it's also definitely rude.
Just put the 2500 example linked on the article through Gemini 1.5 Flash and it answered correctly ("The tree has diseased leaves and its bark is peeling.") https://aistudio.google.com/
Wow. Cool. I have access to that model and have also seen some impressive context extraction. It also gave a really good summary of a large code base that I dumped in. I saw somebody analyze a huge log file, but we really need something like this needle in a needlestack to help identify when models might be missing something. At the very least, this could give model developers something to analyze their proposed models.
Funnily enough I ran a 980k token log dump against Gemini Pro 1.5 yesterday to investigate an error scenario and it found a single incident of a 429 error being returned by a third-party API provider while reasoning that "based on the file provided and the information that this log file is aggregated of all instances of the service in question, it seems unlikely that a rate limit would be triggered, and additional investigation may be appropriate", and it turned out the service had implemented a block against AWS IPs, breaking a system that loads press data from said API provider, leaving the customer who was affected by it without press data -- we didn't even notice or investigate that, and Gemini just randomly mentioned it without being prompted for that.
Man, we are like 2-5 years away from being able to feed in an ePub and get an accurate graphic novel version in minutes. I am so ready to look at four thousand paintings of Tolkien trees.
What version of Gemini is built into Google Workspace? (I just got the ability today to ask Gemini anything about emails in my work Gmail account, which seems like something that would require a large context window)
Someone needs to come up with a "synthesis from haystack" test that tests not just retrieval but depth of understanding, connections, abstractions across diverse information.
When a person reads a book, they have an "overall intuition" about it. We need some way to quantify this. Needle in haystack tests feel like a simple test that doesn't go far enough.
An elaborate Agatha Christie style whodunit, with a series of plot-twists and alibis which can be chopped off the end of the piece to modify who is the most likely suspect
My idea is to buy to a unpublished novel or screenplay with a detailed, internally consistent world built in to it and a cast of characters that have well crafted motivations and then ask it to continue writing from an arbitrary post-mid-point by creating a new plot line that combines two characters that haven't yet met in the story. If it understands the context it should be able to write a new part of the story and will be able to use a reader's intuitive sense of the character's motivations to move through their arc.
This whole thing would have to be kept under lock-and-key in order to be useful, so it would only serve as a kind of personal benchmark. Or it could possibly be a prestige award that is valued for its conclusions and not for its ability to use the methodology to create improvements in the field.
I was thinking about something similar -- to make part of the question be sufficient information that the LLM can find the limerick. Then the 2nd part would ask something that would require a deeper understanding of the limerick (or other text).
GPT4o still can't do the intersection of two different ideas that are not in the training set. It can't even produce random variations on the intersection of two different ideas.
Further though, we shouldn't expect the model to do this. It is not fair to the model and its actual usefulness and how amazing what the models can do with zero understanding. To believe the model understands is to fool yourself.
I wonder if there is some way to have an AI help humans improve their "reading comprehension" aka reasoning across a large body of text. As far as I can tell the only way to do this is to cut out mindless scrolling and force yourself to read a lot of books in the hopes that this skill might be improved.
I am many years out of my grade school years where I was required to read a multitude of novels every year and I guess years of mindless reddit scrolling + focusing on nothing but mathematics and the sciences in college have taken their toll: I read long articles or books but completely miss the deeper meaning.
As an example: my nerd like obsession with random topics of the decade before I was born (until I get bored) caused me to read numerous articles and all of Wikipedia + sources on the RBMK reactors and Chernobyl nuclear accident as well as the stories of the people involved.
But it wasn't until I sat down and watched that famous HBO mini seres that I finally connected the dots of how the lies and secretive nature of the soviet system led to the design flaws in the reactor, and the subsequent suicide of Valery Legasov helped finally expose them to the world where they could no longer be hidden.
Its like I knew of all these events and people separately but could not connect them together to form a deep realization and when I saw it acted out on screen it all finally hit me like a ton of bricks. How had I not seen it?
Hoping one day AI can just scan my existing brain structure and recommend activities to change the neuronal makeup to what I want it to be. Or even better since im a lazy developer, it should just do it for me.
It's hard, but if you have a piece of fiction or non-fiction it hasn't seen before, then a deep reading comprehension question can be a good indicator. But you need to be able to separate a true answer from BS.
"What does this work says about our culture? Support your answer with direct quotes."
I found both gpt-4 and haiku to do alright at this, but sometimes give answers that imply fixating on certain sections of a 20,000 k context. You could compare it against chunking the text, getting the answer for each chunk and combining them.
I suspect if you do that then the chunking would win for things that are found in many chunks, like the work is heavy handed on a theme, but the large context would be better for a sublter message, except sometimes it would miss it altogether and think a Fight Club screenplay was a dark comedy.
Well, I can now use GPT to transform raw dynamic data into beautiful HTML layouts on the fly for low-traffic pages, such as change/audit logs, saving a ton of development time and keeping my HTML updated even when the data structure has changed. My last attempt did not consistently work because GPT4-Turbo sometimes ignored the context and instructions almost entirely.
Here is the entire prompt. I used rules to ensure the formatting is consistent as otherwise sometimes it might format date one way and other times in an entirely different way.
Imagine, a truly dynamic and super personal site, where layout, navigation, styling and everything else gets generated on the fly using user's usage behavior and other preferences, etc. Man!
---------------------------------------------
{JSON}
------
You are an auditing assistant. Your job is to convert the ENTIRE JSON containing "Order Change History" into a human-readable Markdown format. Make sure to follow the rules given below by letter and spirit. PLEASE CONVERT THE ENTIRE JSON, regardless of how long it is.
---------------------------------------------
RULES:
- Provide markdown for the entire JSON.
- Present changes in a table, grouped by date and time and the user, i.e., 2023/12/11 12:40 pm - User Name.
- Hide seconds from the date and time and format using the 12-hour clock.
- Do not use any currency symbols.
- Format numbers using 1000 separator.
- Do not provide any explanation, either before or after the content.
- Do not show any currency amount if it is zero.
- Do not show IDs.
- Order by date and time, from newest to oldest.
- Separate each change with a horizontal line.
The article shows how much better GPT-4o is at paying attention across its input window compared to GPT-4 Turbo and Claude-3 Sonnet.
We've needed an upgrade to needle in a haystack for a while and this "Needle In A Needlestack" is a good next step! NIAN creates a prompt that includes thousands of limericks and the prompt asks a question about one limerick at a specific location.
I agree, I paid for Claude for a while. Even though they swear the context is huge and having a huge context uses up tokens like crack, it's near useless when source code in context just a few pages back. It was so frustrating as everything else was as good as anything and I liked the 'vibe'.
I used 4o last night and it was still perfectly aware of a C++ class I pasted 20 questions ago. I don't care about smart, I care about useful and this really contributes to the utility.
I'm just glad that we are finally past the "Who was the 29th president of the United States" and "Draw something in the style of Van Gogh" LLM evaluation test everyone did in 2022-2023.
I think it very likely that gpt-4o was trained on this. I mean, why would you not? Innnput, innnput, Johnny five need more tokens.
I wonder why the NIAN team don't generate their limericks using different models, and check to make sure they're not in the dataset? Then you'd know the models couldn't possibly be trained on them.
Maybe some models hallucinate or even ignore your mistake vs others correcting it (depending on the context ignoring or calling out the error might be the more 'correct' approach)
Using limericks is a very nifty idea!
Using ctrl-f I was able to see that they were identical in one another.
Obviously this is a single sample but saying 90% seems unlikely. They were around ~80k tokens total.
I would note that LLMs handle this task better if you slice the two documents into smaller sections and iterate section by section. They aren’t able to reason and have no memory so can’t structurally analyze two blobs of text beyond relatively small pieces. But incrementally walking through in much smaller pieces that are themselves semantically contained and related works very well.
The assumption that they are magic machines is a flawed one. They have limits and capabilities and like any tool you need to understand what works and doesn’t work and it helps to understand why. I’m not sure why the bar for what is still a generally new advance for 99.9% of developers is effectively infinitely high while every other technology before LLMs seemed to have a pretty reasonable “ok let’s figure out how to use this properly.” Maybe because they talk to us in a way that appears like it could have capabilities it doesn’t? Maybe it’s close enough sounding to a human that we fault it for not being one? The hype is both overstated and understated simultaneously but there have been similar hype cycles in my life (even things like XML were going to end world hunger at one point).
Needle-in-a-needlestack contrasts with needle-in-a-haystack by being about finding a piece of data among similar ones (e.g. one specific limeric among thousands of others), rather than among disimilar ones.
This is such an anti-intellectual comment to make, can't you see that?
You mention "sample" so you understand what statistics is, then in the same sentence claim 90% seems unlikely with a sample size of 1.
The article has done substantial research
He's a much simpler and correct description that almost everyone can understand: it fucks up constantly.
Getting something wrong even once can make it useless for most people. No amount of pedantry will change this reality.
Deleted Comment
Deleted Comment
Also: would you expect random people to fare any better?
It's purported to be a major use case.
RULER is a much better test:
https://github.com/hsiehjackson/RULER
> Despite achieving nearly perfect performance on the vanilla needle-in-a-haystack (NIAH) test, all models (except for Gemini-1.5-pro) exhibit large degradation on tasks in RULER as sequence length increases.
> While all models claim context size of 32k tokens or greater (except for Llama3), only half of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama2-7b performance at 4K (85.6%). The performance exceeding the threshold is underlined.
1. The article is not about NIHS it’s their own variation so it could be more relevant.
2. The whole claim of the article is that Gpt4o does better, but the test your pointing to hasn’t benchmarked it.
See BooookScore (https://openreview.net/forum?id=7Ttk3RzDeu) which was just presented at ICLR last week and FABLES (https://arxiv.org/abs/2404.01261) a recent preprint.
Could be a simpler setup than RAG for slow-changing documentation, especially for read-heavy cases.
How far off am I?
FABLES/booklist.md: https://github.com/mungg/FABLES/blob/main/booklist.md
/gscholar_related? FABLES: https://scholar.google.com/scholar?q=related:Y-Hx-kplbEUJ:sc...
/gscholar_citations? BoookScore: https://scholar.google.com/scholar?cites=1796862036168524911...
...
From that one day awhile ago: https://news.ycombinator.com/item?id=38347868#38354679 :
> "LLMs cannot find reasoning errors, but can correct them" [ https://arxiv.org/abs/2311.08516 ] https://news.ycombinator.com/item?id=38353285
It replied: "The text indicates that graphene sheets present high optical transparency and are able to absorb thermal radiations with high efficacy. They can then convert these radiations into electrical signals efficiently.".
Screenshot of the PDF with the relevant sentence highlighted: https://i.imgur.com/G3FnYEn.png
[0] https://www.routledge.com/Advances-in-Green-and-Sustainable-...
This doesn't mean you're wrong, though.
Deleted Comment
When a person reads a book, they have an "overall intuition" about it. We need some way to quantify this. Needle in haystack tests feel like a simple test that doesn't go far enough.
Generate 1000 generic facts about Alice and the same 1000 facts about Eve. Randomise the order and change one minor detail then ask how they differ.
This whole thing would have to be kept under lock-and-key in order to be useful, so it would only serve as a kind of personal benchmark. Or it could possibly be a prestige award that is valued for its conclusions and not for its ability to use the methodology to create improvements in the field.
GPT4o still can't do the intersection of two different ideas that are not in the training set. It can't even produce random variations on the intersection of two different ideas.
Further though, we shouldn't expect the model to do this. It is not fair to the model and its actual usefulness and how amazing what the models can do with zero understanding. To believe the model understands is to fool yourself.
I am many years out of my grade school years where I was required to read a multitude of novels every year and I guess years of mindless reddit scrolling + focusing on nothing but mathematics and the sciences in college have taken their toll: I read long articles or books but completely miss the deeper meaning.
As an example: my nerd like obsession with random topics of the decade before I was born (until I get bored) caused me to read numerous articles and all of Wikipedia + sources on the RBMK reactors and Chernobyl nuclear accident as well as the stories of the people involved.
But it wasn't until I sat down and watched that famous HBO mini seres that I finally connected the dots of how the lies and secretive nature of the soviet system led to the design flaws in the reactor, and the subsequent suicide of Valery Legasov helped finally expose them to the world where they could no longer be hidden.
Its like I knew of all these events and people separately but could not connect them together to form a deep realization and when I saw it acted out on screen it all finally hit me like a ton of bricks. How had I not seen it?
Hoping one day AI can just scan my existing brain structure and recommend activities to change the neuronal makeup to what I want it to be. Or even better since im a lazy developer, it should just do it for me.
It's hard, but if you have a piece of fiction or non-fiction it hasn't seen before, then a deep reading comprehension question can be a good indicator. But you need to be able to separate a true answer from BS.
"What does this work says about our culture? Support your answer with direct quotes."
I found both gpt-4 and haiku to do alright at this, but sometimes give answers that imply fixating on certain sections of a 20,000 k context. You could compare it against chunking the text, getting the answer for each chunk and combining them.
I suspect if you do that then the chunking would win for things that are found in many chunks, like the work is heavy handed on a theme, but the large context would be better for a sublter message, except sometimes it would miss it altogether and think a Fight Club screenplay was a dark comedy.
Interpretation is hard I guess.
Deleted Comment
Imagine, a truly dynamic and super personal site, where layout, navigation, styling and everything else gets generated on the fly using user's usage behavior and other preferences, etc. Man! ---------------------------------------------
{JSON} ------ You are an auditing assistant. Your job is to convert the ENTIRE JSON containing "Order Change History" into a human-readable Markdown format. Make sure to follow the rules given below by letter and spirit. PLEASE CONVERT THE ENTIRE JSON, regardless of how long it is. --------------------------------------------- RULES: - Provide markdown for the entire JSON. - Present changes in a table, grouped by date and time and the user, i.e., 2023/12/11 12:40 pm - User Name. - Hide seconds from the date and time and format using the 12-hour clock. - Do not use any currency symbols. - Format numbers using 1000 separator. - Do not provide any explanation, either before or after the content. - Do not show any currency amount if it is zero. - Do not show IDs. - Order by date and time, from newest to oldest. - Separate each change with a horizontal line.
We've needed an upgrade to needle in a haystack for a while and this "Needle In A Needlestack" is a good next step! NIAN creates a prompt that includes thousands of limericks and the prompt asks a question about one limerick at a specific location.
I used 4o last night and it was still perfectly aware of a C++ class I pasted 20 questions ago. I don't care about smart, I care about useful and this really contributes to the utility.