Surveys are incredibly important for user and market research, but are expensive and take months to design, run, and analyze. By simulating responses, our users can get results in seconds and make decisions faster. See https://roundtable.ai/showcase for a bunch of examples, and https://www.loom.com/share/eb6fb27acebe48839dd561cf1546f131 for a demo video.
Our product lets you add questions (e.g. “how old are you”) and conditions (e.g. “is a Hacker News user”) and then see how these affect the survey results. For example, the survey “Are you interested in buying an e-bike?” shows ‘yes’ 28% [1]. But if you narrow it down to people who own a Tesla, ‘yes’ jumps to 52% [2]. Another example: if you survey “where did you learn to code”, the question “how old are you?” makes a dramatic difference—for “45 or older” the answer is 55% “books” [3], but for “younger than 45” it’s 76% “online” [4]. One more: 5% of people answer “legroom” to the question “Which of the following factors is most important for choosing which airline to fly?” [5], and this jumps to 20% when you condition on people over six feet tall [6].
You wouldn’t think (well, we didn’t think) that such simulated surveys would work very well, but empirically they work a lot better than expected—we have run many surveys in the wild to validate Roundtable's results (e.g. comparing age demographics to U.S. Census data). We’re still trying to figure out why. We believe that LLMs that are pre-trained on the public Internet have internalized a lot of information/correlations about communities (e.g. Tesla drivers, Hacker News, etc.) and can reasonably approximate their behavior. In any case, researchers are seeing the same things that we are. A nice paper by a BYU group [7] discusses extracting sub-population information from GPT/LLMs. A related paper from Microsoft [8] shows how GPT can simulate different human behaviors. It’s an active research topic, and we hope we can get a sense of the theoretical basis relatively soon.
Because these models are primarily trained on Internet data, they start out skewed towards the demographics of heavy Internet users (e.g., high-income, male). We addressed this by fine-tuning GPT on the GSS (General Social Survey [9] - the gold standard of demographic surveys in the US) so our models emulate a more representative U.S. population.
We’ve built a transparency feature that shows how similar your survey question is to the training data and thus gives a confidence metric of our accuracy. If you click ‘Investigate Results’, we report the most similar (in terms of cosine distance between LLM embeddings) GSS questions as a way of estimating how much extrapolation / interpolation is going on. This doesn’t quite address the accuracy of the subpopulations / conditioning questions (we are working on this), but we thought we are at a sufficiently advanced point to share what we’ve built with you all.
We're graduating PhD students from Princeton University in cognitive science and AI. We ran a ton of surveys and behavioral experiments and were often frustrated with the pipeline. We were looking to leave academia, and saw an opportunity in making the survey pipeline better. User and market research is a big market, and many of the tools and methods the industry uses are clunky and slow. Mayank’s PhD work used large datasets and ML for developing interpretable scientific theories, and Matt’s developed complex experimental software to study coordinated group decision-making. We see Roundtable as operating at the intersection of our interests.
We charge per survey. We are targeting small and mid-market businesses who have market research teams, and ask for a minimum subscription amount. Pricing is at the bottom of our home page.
We are still in the early stages of building this product, and we’d love for you all to play around with the demo and provide us feedback. Let us know whatever you see - this is our first major endeavor into the private sector from academia, and we’re eager to hear whatever you have to say!
[1]: https://roundtable.ai/sandbox/e02e92a9ad20fdd517182788f4ae7e...
[2]: https://roundtable.ai/sandbox/6b4bf8740ad1945b08c0bf584c84c1...
[3] https://roundtable.ai/sandbox/d701556248385d05ce5d26ce7fc776...
[4] https://roundtable.ai/sandbox/8bd80babad042cf60d500ca28c40f7...
[5] https://roundtable.ai/sandbox/0450d499048c089894c34fba514db4...
[6] https://roundtable.ai/sandbox/eeafc6de644632af303896ec19feb6...
[7] https://arxiv.org/abs/2209.06899
If a researcher comes out and says, “Surveys show that people want X, and they do not like Y,” and then others ask the researcher if they surveyed people, the answer would be “no.”
Fundamentally, people wanting feedback from humans will not get that by using your product.
The best you can say is this: “Our product is guessing people will say X.”
Out of One, Many: Using Language Models to Simulate Human Samples (https://arxiv.org/abs/2209.06899)
There's been some research in this vain. To answer your question, seemingly very valid.
Internal purposes include stuff like optimally rewording questions and getting priors.
A hybrid approach would be something like - hey let's not ask someone 100 questions because we can accurately predict 80%. Let's just ask them the hard-to-estimate 20 questions
This kind of concerns me because you could use this to bias surveys in different directions. This obviously already happens, so maybe it just part of the status quo.
I suspect people would use this product as a quick gut check to decide if it is warranted to spend the time and money on a full scale quant study.
This is like a 10/10.
[1] https://www.youtube.com/watch?v=G0ZZJXw4MTA
I see the problem as although you can create lots of examples that are correct/follow real world opinions, you can never prove that a particular question is correct/follows real world opinion. I'm not sure who would trust the output enough to rely on it for decision making.
On a more personal note, while all of the AI advances have been very interesting, I worry that AI will reduce human connection, and a product like this sure seems to do that. You are telling users that they don't need to talk to real people, and can just get feedback from a model instead.
Edit: for example, here's your dataset by race: https://imgur.com/a/134epoN
I asked, "Which race is most likely to commit a crime?": https://imgur.com/a/4QJZo2O
2. We add the transparency features (click on 'Investigate Results') that shows how in vs. out-of-distribution the target question is. For out-of-distribution, we suggest people run traditional surveys.
More broadly, I think your point is really interesting when it comes to qualitative data. That is one reason we haven't generated qualitative survey data, but a lot of potential customers have already started to ask for it.
----
[a] https://roundtable.ai/sandbox/baa3d5f25236b91f1608c9f606b315...
[b] https://roundtable.ai/sandbox/7a9ee27872eb29087be2386ccd19f7...
We definitely need to think how to handle your question so that it's clear where survey data converges/diverges with reality.
What metric(s) are you using to measure bias in general, and what do those metric(s) look like before and after your tuning?
Speaking as a potential user, my biggest hang up is trust. How can I trust that Roundtable’s results are accurate and not the result of hallucination?
One of the powerful things about data is that they surprise you. This is why data integrity is so important (“crap in, crap out” as the old adage goes). But if I get a surprising result from Roundtable, how can I verify it? I think you two are already thinking about this and building features to address it.
I’m also wondering if trying to verify a surprising result from Roundtable is the wrong response…Why would a LLM give me that answer? There may be something useful to understand about why the LLM is “hallucinating.” In terms of features, it may be interesting to see whether Roundtable’s LLM could explain its answer.
The UX could be like having a brilliant but inscrutable research assistant…
LLMs model a static distribution, whereas consumer preferences change over time to the point that companies regularly run the same survey at different points in time. At my old fund we would run the same surveys every month to track changes on various companies. How do you counteract this time effect? Presumably a lot of your training data is from the past.
To give one example from your summary - the demographics of Tesla owners have change significantly over time from a pure luxury, avant garde market to much mass market. So info about Tesla from 5 years ago is not that useful
The data we trained on has year, so we can specify the year you ask the question (the default is 2023). You can also see how answers change over time. [1] shows how the distribution for "Do you support the President" changes from 2000 to 2023 (see the 9/11 spike, end of Bush era, Obama era, Trump era, etc.) [1] https://roundtable.ai/sandbox/2dd4e9d32c24e9abff01810695e948...
I’d also be interested in how much you think your platform is just capturing say reported surveys/data. President polling is something that must be all over LLM datasets- isn’t that just replicating the training data?
I think you could do a better job of showing on your website the following - here are some unusual survey results we generated from the model - I.e. stuff definitely not in the training data - and here’s the data we actually got when we did that survey for real
One related pain point I have seen many times with surveys is that the people writing them don't know what they're doing and get bad data as a result of biased questions.
Could be cool to add functionality down the line to help people craft better questions. For example, your app could provide alternate ways of phrasing questions and then simulate how results would differ based on the wording.
Excited to see where this goes! Going to share with my partner who works for a survey software company and see what she thinks.
Thank you for the kind words / reference
The answer seems plausible with that interpretation.
https://docs.google.com/spreadsheets/d/1YtvcLkC-xaTw3q6LOxCq...
The average delta across 11 questions between actual selected response % and simulated %, across 11 questions was 7%. Seems like a good start - it would make it useful for certain low-impact, high-speed business decisions.
It's not surprising that LLMs can predict the answers to survey questions, but really good primary research generates surprising insights that are outside of existing distributions. Have you found that businesses trust your results? I have found that most businesses don't trust survey research much at all, and this seems like it might be even less reliable.
-----
Context: I co-founded & sold survey software company (YC W20).
Trust is one of the biggest issues we're trying to solve. This motivated the tSNE plots and similarity scores under 'Investigate Results', but we definitely have a long way to go. Generally speaking, survey practitioners trust us more than their clients (perhaps not surprising)
https://news.ycombinator.com/item?id=36868552