I have run a few queries on Elicit to understand the product a bit more. I asked about media bias detection and used the topic analysis feature. A minute or so later, I had a list of concepts with citations and links to papers I can look at further. This feels like an _amazing_ tool to do literature overviews and to dive into new academic domains with which one is not familiar.
I gave it a topic I researched in depth recently. It gave me mostly incorrect summaries (one said hypothesis X is confirmed; nope it hasn't, would have been all over the news), missed key papers, dug up obscure and irrelevant ones. Par for the course with LLMs.
Edit: After looking at the examples on the front page "What are the benefits of taking l-theanine?" this seems geared for the general public, so maybe it wasn't the right test.
I think of AI as a super keen arrogant intern. With GitHub copilot, an intern that constantly interrupts me.
When that works for me, I am probably weak on the subject material myself. eg writing quirky love poems to my wife in different styles.
For research tasks, because the AI is not deeply self-reflective, it can output inconsistent and incoherent results. What it does is present text that only *looks* as if it confidently knows what it is talking about.
For domains where high rational quality doesn’t matter like love poetry, it is amazing. For other domains, be wary. If you can’t tell the difference between what is actually good and what merely looks good superficially you will be in trouble.
I generally agree with your take at first, but the following statements are funny to me:
“ I am probably weak on the subject material myself. eg writing quirky love poems to my wife in different styles.”
“ For domains where high rational quality doesn’t matter like love poetry, it is amazing.”
So, self-described weak at love poetry, but confident that it is a domain that LLMs excel at. That is an interesting take. Perhaps the LLM is just as weak at liberal arts as it is hard science, but it is just more difficult to measure since you aren’t in the domain. Most poetry I’ve seen from LLMs has been pretty rote and boring although as you say, not a “rational quality” I suppose.
"Geared for the general public" is an interesting and revealing observation.
You might not have given the "right test" in terms of the actual userbase, but it is absolutely the right test in terms of Elicit's marketing claims. Elicit might be implicitly geared for the general public, but they are explicitly marketing to scientists.
I suspect a lot of Elicit's target users want to use scientific knowledge in their personal/professional lives, but without doing the hard work of gaining scientific understanding. However, they're not going to spend money on a product that says "we use AI to create sciencey bullshit that sounds plausible in conversation." They want a product that Real Scientists would use. (Similar to how purely decorative Damascus steel Bowie knives are gussied up by an outdoorsman pretending to use the knife to gut a fish or whatever.)
> I suspect a lot of Elicit's target users want to use scientific knowledge in their personal/professional lives, but without doing the hard work of gaining scientific understanding. However, they're not going to spend money on a product that says "we use AI to create sciencey bullshit that sounds plausible in conversation." They want a product that Real Scientists would use.
It sounds kind of like toy marketing: want to sell a toy to 5 year olds? Show 7 or 8 year olds playing with it, even if they'd never actually choose the toy in real life.
It concerned extant liquid water flows on Mars. I checked again and the tool actually correctly summarised a conference paper [1]. The authors changed the wording in the abstract from "confirms" to "strongly supports" in the actual paper [2]. So the mistake AI made here was in selecting the (obscure) conference paper with 7 citations over the actual paper with over 400.
We now know though that the perchlorate detection may have been an instrument error and satellite imagery constraints water content of the RSL to below what would be expected from brine flows. It's not conclusive though and there is no consensus whether the RSL are caused by liquid or dry processes or some combination of both.
I’ve learned to discard those as being either totally fake and made up (the excuse would be something like we used a template to get started quickly and forgot to change the logos), or that someone (probably an intern) signed up to the service with @company email and they just splat the logo as an official endorsement.
As someone who also tried (and stopped trying) to use customer logos, I also wonder whenever I see this pattern. So many startups do this, and yet I know how difficult it is to get official permission and how using a name without permission can lead to serious consequences.
Do startups just roll with it and use the logos without permission?
Testing Elicit gave me quite a bit worse results than using PaperQA by futurehouse. While paperqa could understand a bit of the nuance of a scientific query, elicit did not.
Too bad the internal paperqa system at scihouse isn't available for public use...
To me it reminds me of the advent of scholarly databases. The main effect I saw is that researchers started using exclusively those databases, sometimes publisher specific databases (so they were citing only from one publisher!) and were missing all the papers that were not indexed there. In particular a big chunk of the older literature that wasn't yet OCRed (it is better but still not fabulous). This led to so many "we discovered a new X" paper that the older people in the crowd in conferences were always debunking "that was known since at least the 60s". While those AI tools can clearly help with initial discovery around a subject, it worries me that it will reduce the search in other databases, or the digging into paper references. It is often enlightening to unravel references and go back in time to realize that all recent papers were basing their work on a false or misunderstood premise. Not talking about the cases where the citation was copied from another paper and either doesn't exist or had nothing to do with the subject. There was a super interesting article about the "mutations" of citations and how you could, by using similar tools to genetic alysis, generate an evolutionary tree of who copied on who and introduced slight errors that would get reproduced by the next one.
Yes, but even the best scientists aren't born with knowledge of what came before. It has to be discovered, and where the discovery process is broken it needs to be fixed. On the individual level, "spend hours chasing rumors about the perfect paper that lives in the stacks, find out the physical stacks are on a different continent, and then sit down and struggle through a pile of shitty scans that are more JPEG artifact than text" makes sense because it's out of scope for a single PhD to fix the academic world, but on the institutional level the answer that scales isn't to berate grad students for failing to struggle enough with broken discovery/summarization tools, it's to fix the tools. Make better scans, fix the OCR, un-f** the business model that creates a thousand small un-searchable silos of papers -- these things need to be done.
It seems like it should hallucinate less, as it directly quates, but nope, it still halucinates just as much and then gives a quote that directly contradicts its statement.
Accuracy and supportedness of the claims made in Elicit are two of the most central things we focus on—it's a shame it didn't work as well as we'd like in this case.
I'd appreciate knowing more about the specifics so we can understand and improve
> A good rule of thumb is to assume that around 90% of the information you see in Elicit is accurate. While we do our best to increase accuracy without skyrocketing costs, it’s very important for you to check the work in Elicit closely. We try to make this easier for you by identifying all of the sources for information generated with language models.
A 90% accuracy rate seems like the sweet spot between "an annoying waste of time" for honest researchers and "good enough to publish" for dishonest careerists.
I don't like disparaging the technology experts who work on these things. But as a business matter, 1/10 answers being wrong just is not good enough for a whole lot of people.
I don't think the number is as important as the question of how would someone be expected to magically know which 10% is wrong and needs to be corrected?
This is a good point! (Hopefully) obviously, if we knew a particular claim was fishy, we wouldn't make it in the app in the first place.
However, we do do a couple of things which go towards addressing your concern:
1. We can be more or less confident in the answers we're giving in the app, and if that confidence dips below a threshold we mark that particular cell in the results table with a red warning icon which encourages caution and user verification. This confidence level isn't perfectly calibrated, of course, but we are trying to engender a healthy, active, wariness in our users so that they don't take Elicit results as gospel.
2. We provide sources for all of the claims made in the app. You can see these by clicking on any cell in the results table. We encourage users to check—or at least spot-check—the results which they are updating on. This verification is generally much faster than doing the generation of the answer in the first place.
This is true but if the error rate were 1/1000 I could see the risk management argument for using this thing. 1/100 is pushing it. 1/10 seems unconscionably reckless and lazy.
> if it takes 1 hour to get one answer by hand, but only 20 minutes for the machine, and 20 minutes to check the answer, the user still comes out ahead
Those numbers are arbitrary and fictional, and the more relevant made-up quantity would be the variance rather than the mean. It doesn't really matter if the "average user" saves time over 10,000 queries. I am much more concerned about the numerous edge cases, especially if those cases might be "edge fields" like animal cognition (see below).
In my experience it takes quite a bit longer to falsify GPT-4's incorrect answers than it does to a Google search and get the right answer. It might take 30 seconds to check a correct answer (jump to the relevant paragraph and check), but 30 minutes to determine where an incorrect answer actually went wrong (you have to read the whole paper in close detail, and maybe even relevant citations). More specifically, it is somewhat quick to falsify something if it is directly contradicted by the text. It is much harder to falsify unsupported generalizations or summaries.
As a specific example, I recently asked GPT for information on arithmetic abilities in amphibians. It made up a study - that was easy to check - but it also made up a bunch of results without citing specific studies. That was not easy to check[1]: each paragraph of text GPT generated needed to be cross-checked with Google Scholar to try and find a relevant paper. It turned out that everything GPT said, over 1000 words of output, was contradicted by actual published research. But I had to read three papers to figure that out. I would have been much better off with Google Scholar. But I am concerned that a large minority of cynical, lazy people will say "90% is good enough, I don't want to read all these papers and nobody's gonna check the citations anyway" and further drag down the reliability of published research.
[1] This was a test of GPT. If I were actually using it for work, obviously I would have stopped at the fake citation.
I am not sure what you mean by this comment. I took the language from the developers. If you mean that commercial AI providers should give more specific information then I agree wholeheartedly.
I assume it is difficult for Elicit to give specific numbers because they lack the data, and confabulations are highly dependent on what research area you are asking about. So the "rule of thumb" is a way of flattening this complexity into a usage guideline.
It will be more easy to sucess in the basic education rather than AI Research(one of the most hard field in the world) when apply AI to education field.
I think it's a little early to bring AI to research field which need enough accuracy and rigorism.
We're glad you're enjoying it.
Edit: After looking at the examples on the front page "What are the benefits of taking l-theanine?" this seems geared for the general public, so maybe it wasn't the right test.
When that works for me, I am probably weak on the subject material myself. eg writing quirky love poems to my wife in different styles.
For research tasks, because the AI is not deeply self-reflective, it can output inconsistent and incoherent results. What it does is present text that only *looks* as if it confidently knows what it is talking about.
For domains where high rational quality doesn’t matter like love poetry, it is amazing. For other domains, be wary. If you can’t tell the difference between what is actually good and what merely looks good superficially you will be in trouble.
“ I am probably weak on the subject material myself. eg writing quirky love poems to my wife in different styles.”
“ For domains where high rational quality doesn’t matter like love poetry, it is amazing.”
So, self-described weak at love poetry, but confident that it is a domain that LLMs excel at. That is an interesting take. Perhaps the LLM is just as weak at liberal arts as it is hard science, but it is just more difficult to measure since you aren’t in the domain. Most poetry I’ve seen from LLMs has been pretty rote and boring although as you say, not a “rational quality” I suppose.
You might not have given the "right test" in terms of the actual userbase, but it is absolutely the right test in terms of Elicit's marketing claims. Elicit might be implicitly geared for the general public, but they are explicitly marketing to scientists.
I suspect a lot of Elicit's target users want to use scientific knowledge in their personal/professional lives, but without doing the hard work of gaining scientific understanding. However, they're not going to spend money on a product that says "we use AI to create sciencey bullshit that sounds plausible in conversation." They want a product that Real Scientists would use. (Similar to how purely decorative Damascus steel Bowie knives are gussied up by an outdoorsman pretending to use the knife to gut a fish or whatever.)
It sounds kind of like toy marketing: want to sell a toy to 5 year olds? Show 7 or 8 year olds playing with it, even if they'd never actually choose the toy in real life.
We now know though that the perchlorate detection may have been an instrument error and satellite imagery constraints water content of the RSL to below what would be expected from brine flows. It's not conclusive though and there is no consensus whether the RSL are caused by liquid or dry processes or some combination of both.
[1] https://meetingorganizer.copernicus.org/EPSC2015/EPSC2015-83...
[2] https://www.nature.com/articles/ngeo2546
How did you get all those blue chip orgs give you permission to use their name? None of our blue chips clients allow us to do so.
Do startups just roll with it and use the logos without permission?
I would start with Hanlon’s Razor, with a 10% chance of malice.
Deleted Comment
Too bad the internal paperqa system at scihouse isn't available for public use...
edit: various typos
Oh no
That's basically the same as the percentage of people who read news stories when responding to or sharing the headline
Tbere is also https://www.researchgate.net/publication/323202394_Opinion_M...
and there was yet another one but I can't find it
Accuracy and supportedness of the claims made in Elicit are two of the most central things we focus on—it's a shame it didn't work as well as we'd like in this case.
I'd appreciate knowing more about the specifics so we can understand and improve
Actual quote from the abstract: “ No tryptamines were detected in the basidiospores, and only psilocin was present at 0.47 wt.% in the mycelium.”
It does not differentiate between psilocin and psilocybin, those are two different molecules.
A 90% accuracy rate seems like the sweet spot between "an annoying waste of time" for honest researchers and "good enough to publish" for dishonest careerists.
I don't like disparaging the technology experts who work on these things. But as a business matter, 1/10 answers being wrong just is not good enough for a whole lot of people.
This is a good point! (Hopefully) obviously, if we knew a particular claim was fishy, we wouldn't make it in the app in the first place.
However, we do do a couple of things which go towards addressing your concern:
1. We can be more or less confident in the answers we're giving in the app, and if that confidence dips below a threshold we mark that particular cell in the results table with a red warning icon which encourages caution and user verification. This confidence level isn't perfectly calibrated, of course, but we are trying to engender a healthy, active, wariness in our users so that they don't take Elicit results as gospel. 2. We provide sources for all of the claims made in the app. You can see these by clicking on any cell in the results table. We encourage users to check—or at least spot-check—the results which they are updating on. This verification is generally much faster than doing the generation of the answer in the first place.
IF the machine actually got it right.
In my experience it takes quite a bit longer to falsify GPT-4's incorrect answers than it does to a Google search and get the right answer. It might take 30 seconds to check a correct answer (jump to the relevant paragraph and check), but 30 minutes to determine where an incorrect answer actually went wrong (you have to read the whole paper in close detail, and maybe even relevant citations). More specifically, it is somewhat quick to falsify something if it is directly contradicted by the text. It is much harder to falsify unsupported generalizations or summaries.
As a specific example, I recently asked GPT for information on arithmetic abilities in amphibians. It made up a study - that was easy to check - but it also made up a bunch of results without citing specific studies. That was not easy to check[1]: each paragraph of text GPT generated needed to be cross-checked with Google Scholar to try and find a relevant paper. It turned out that everything GPT said, over 1000 words of output, was contradicted by actual published research. But I had to read three papers to figure that out. I would have been much better off with Google Scholar. But I am concerned that a large minority of cynical, lazy people will say "90% is good enough, I don't want to read all these papers and nobody's gonna check the citations anyway" and further drag down the reliability of published research.
[1] This was a test of GPT. If I were actually using it for work, obviously I would have stopped at the fake citation.
I assume it is difficult for Elicit to give specific numbers because they lack the data, and confabulations are highly dependent on what research area you are asking about. So the "rule of thumb" is a way of flattening this complexity into a usage guideline.
I think it's a little early to bring AI to research field which need enough accuracy and rigorism.