I agree with you on open source in the original, home tinkerer sense.
If your focus is narrow enough the vanilla gpt can still provide good enough results. We narrow down the scope for the gpt and ask it to answer binary questions. With that we get good results.
Your approach is better for supporting broader questions. We support that as well and there the results aren’t as good.
Specific failure modes can be something as simple as extraction of beneficiary information from a Trust document. Sometimes it works, but a lot of times it doesn't even with startups with AI products specific to extracting information from documents. For example it will have an incomplete list of beneficiaries, or if there are contingent beneficiaries, it won't know what to do. Not even a hard question about the contingency. Just making a simple list with percentages of if no-one dies what is the distribution.
Further trying to get an AI to describe the contingency is a crap shoot.
While I expect these options to get better and better, I have fun trying them out and seeing what basic thing will break. :)
If the example is representative, I see two problems: a simple extraction of information that is laid out bare (list of beneficiaries), and reasoning to interpret the section of contingent beneficiaries and connect it facts from other parts. Is that correct?
If that's the case, then Hotseat is miles ahead when it comes to analyzing regulations (from the civil law tradition, which is different from the US), and dealing with the categories of problems you mentioned.
In terms of contract review, what I've found is that GPT is better at analysis of the document than generating the document, which is what this paper supports. However, I have used several startups options of AI document review and they all fall apart with any sort of prodding for specific answers. This paper looks like it just had to locate the section not necessarily have the back and forth conversation about the contract that a lawyer and client would have.
There is also no legal liability for GPT for giving the wrong answer. So It works well for someone smart who is doing their own research. Just like if you are smart you could use google before to do your own research.
My feelings on contract generation is that for the majority of cases, people are better served if there were simply better boilerplate contracts available. Laywers hoard their contracts and it was very difficult in our journey to find lawyers who would be willing to write contracts we would turn into templates because they are essentially putting themselves and their professional community out of income streams in the future. But people don't need a unique contract generated on the fly from GPT every time when a template of a well written and well reviewed contract does just fine. It cost hundreds of millions to train GPT4. If $10m was just spent building a repository of well reviewed contracts, it would be a more useful than spending the equivalent money training a GPT to generate them.
People ask pretty wide range of questions about what they want to do with their documents and GPT didn't do a great job with it, so for the near future, it looks like lawyers still have a job.
I'm working on Hotseat - a legal Q&A service where we put regulations in a hot seat and let people ask sophisticated questions. My experience aligns with your comment that vanilla GPT often performs poorly when answering questions about documents. However, if you combine focused effort on squeezing GPT's performance with product design, you can go pretty far.
I wonder if you have written about specific failure modes you've seen in answering qs from documents? I'd love to check whether Hotseat is handling them well.
If you'r curious, I've written about some of the design choices we've made on our way to creating a compelling product experience: https://gkk.dev/posts/the-anatomy-of-hotseats-ai/
(on a larger point of the AI Act leaving much to be desired, I agree)
I asked during Product Hunt launch and still waiting for the answer. There should be option to provide e-mail and get notified.
Re email: the submission form has a second step where you can opt-in to leave your email address to get notified. Did it not show for you?
One of the most non-obvious discoveries we made was that for such long documents, turning it into a Markdown (with marked headings), as opposed to plain text, made a night-and-day difference in LLM's reasoning performance. I have my guesses as to why this could be the case, but I'm curious to hear your hypothesis and whether you've seen similar effects in the wild?
The product analyzes customer interview transcripts to catch when founders slip into "pitch mode" instead of learning. It's based on principles from The Mom Test book - essentially a digital coach that flags your mistakes and gives you personalized advice how to do better.
Why this project for our trial:
- Real problem we'd witnessed (founders talking too much in user interviews) - Tight scope but production-grade requirement - Chance to push AI-accelerated development to its limits
Tech: Next.js 15, Supabase, Trigger.dev, GPT-4.1 via Vercel AI SDK. We used Cursor, Claude Code, V0, and (briefly) Grok for development.
Key learning: AI development requires adopting new working patterns. You can think of AI as a chaotic software engineering intern. You need to be highly intentional in guiding the AI to do the right thing. Just like with human teams, bad managers get bad output from their people and the same applies to managing AI.
If you're an experienced software engineer, you have a lot of implicit assumptions about how to build software, how to rate importance of tasks, etc. You need to transfer these to the AI, and we think we found early patterns how to do this well.
For example, we used "walking skeleton" and "tracer bullet" concepts to structure project planning we did with AI. We found the basic pattern of think-research-brainstorm, and plan before writing any code to dramatically improve the quality of AI coding, as the project gets more complex. E.g. we'd plan error handling with AI first, save it as a doc, then use that as context for implementation - this kept the AI consistent across the codebase.
We shared details of this approach at Warsaw AI Tinkerers (over 200 people attending) a couple of weeks ago.
The co-founder trial worked - we built a working mini-product in 5 weeks, found out how we approach this alien technology in the form of modern AI, and uncovered many interesting personal quirks of each other (everyone has them).
You can check out Unpitched at https://unpitched.app. Sadly, we require sign up as underlying LLM calls are a little expensive.
We wrote more about how we approached the cofounder trial process at https://unpitched.app/about. Let us know if you have any questions about our trial, maybe share your own stories of looking for cofounders, or have any feedback on the app!
PS. Shootout to Circleback team (YC W24) as the only note-taking app we found that has working webhooks that we could integrate with Unpitched.
-- gkk & ykka