Optimizing technical docs for LLMs

I felt most of this was just plain common sense. People read things by headings and sub headings. People look for relevant documentation for product x under product x not y. QA and code samples primes the LLM for what most developers like myself hunt for (a quick answer or a simple code snippet). The forum part got me. Seeing as forums tend to be variable in the quality and quantity of info. If the author(s) suggest forums why not discord servers and gitter chat as well? I know of serveral projects where the real documentation, examples and help is locked up on the discord/gitter channels. Also, in the same vein, why not Github PRs/issues as well? Having the LLM diagnose when an issue was cleared up, migration strats, etc. from github PRs/issues (as I've had to use from time to time) would be great too. Of course, Github/Discord/Gitter would require some kind of filtering to make sure it is data worth ingesting into the LLM but if it can identify it was worth ingesting then perhaps it could also suggest to the documentation team something worth documenting.

emil_sorensen · 2 years ago

You nailed it. All of above sources are also super helpful for LLMs, but you have to be careful about how to ingest/parse them. For example for a Discourse Forum only including questions that have been marked "Resolved" by an official team member can work quite well. Same goes for Discord/Slack forums and GitHub Discussions etc.

nerpderp82 · 2 years ago

A lot of innovation is common sense because oddly enough, common sense is not widely distributed.

I love that writing LLM-friendly docs is just... writing good docs. There's a ton of overlap between accessibility work and preparing things to be used by LLMs.

I wonder if an unintended side effect of this AI hype cycle is a huge investment in more accessible applications.

emil_sorensen · 2 years ago

Exactly! It's very much garbage-in-garbage-out.

One surprising (to me at least) benefit of hooking up an LLM to your docs is that it is actually a really useful way to find gaps in your docs. For example, when an LLM cannot answer a user question, there's a good chance it's because the answer is not documented anywhere.

thfuran · 2 years ago

What about when it answers with confabulation?

thwarted · 2 years ago

Just like how (it used to be) that writing good content would rank high in search results. But writing good content is hard. So we'll kid ourselves into writing good content for LLMs. Which is equally hard, but we'll feel like we're getting a leg up over everyone else -- who are all also doing the same thing.

It's unfortunate that people are more motivated to write for LLMs that are then used by humans than write for humans to begin with. Especially when the reason to use LLMs is because, on average, content is subpar making it difficult to find the good content.

Another case of Tragedy of the Commons Ruins Everything Around Me.

emil_sorensen · 2 years ago

Fair point. Although good writing for humans = good writing for LLMs and vice versa. So I'm hopeful this new excitement around AI for docs if anything will just encourage folks to put even more effort into writing great docs.

ben_w · 2 years ago

My current working hypothesis is that the way to get the best out of an LLM (and any AI which uses them as the human interface layer) is the same way to get the best out of a human — because it's trained on humans interacting with other humans.

If you yell and swear at the chatbot, you'll get the response most similar to how a human would respond to yelling and swearing. I know the stereotype about drill instructors, but does that even work for marines, or is it just an exercise in learning to cope with stress?

tmm84 · 2 years ago

islandert · 2 years ago

I wonder how many of these groups went the opposite direction — creating the structure of those web pages by using an LLM?

I've (obviously, like almost everyone) experimented with creating stuff with ChatGPT, and… hmm. I was going to write "it made web pages like that", but: Clever Hans. I don't know if I might have subconsciously primed it to, because that's also something I like.

da39a3ee · 2 years ago

Presumably all the structure and section headings that they recommend don't have to be rendered by a browser as visible to humans. The LLMs should be smart enough to understand HTML directives that don't add a lot of unnecessary visual structure.

They don't have to be rendered in the browser, but having all of the structures and section headings help humans to, so I would recommend it. :)

It can go too far. Too many section headings and it becomes unreadable, like an undergraduate textbook where you're constantly being distracted by sections and boxes.

ChrisMarshallNY · 2 years ago

This is useful stuff, but it is … familiar …

Havoc · 2 years ago

Isn’t there enough markdown floating around that the LLM can deduce structure from the three levels of headings?

zerop · 2 years ago

Can I say - For RAG based system using LLM, will it be more effective to structure the source documents in this way?

Yes, definitely!

Author here. Happy to answer any questions!

gwern · 2 years ago

How do you measure any of these improvements?

Extensive LLM evals. It's a real rabbit hole.

tschellenbach · 2 years ago

What are the competing solutions for connecting github, docs to LLMs? Who are your main competitors?