Readit News logoReadit News
Jetwu commented on Show HN: Evaluating LLMs on creative writing via reader usage, not benchmarks   narrator.sh/... · Posted by u/Jetwu
skyzouwdev · 12 days ago
Really like the shift from synthetic benchmarks to actual reader engagement — feels way more aligned with what “good writing” actually means. Curious if you’ve noticed certain models consistently improving more with feedback than others.
Jetwu · 12 days ago
Thanks! Anecdotally, I'd tend to say that Claude 3.7 tends to improve the most, but it seems like (via the leaderboard), some people really prefer Grok-3 lol.
Jetwu commented on Show HN: Evaluating LLMs on creative writing via reader usage, not benchmarks   narrator.sh/... · Posted by u/Jetwu
BoorishBears · 13 days ago
I run a site that does something similar, but on a more granular level (prompts at the page level rather than the chapter)

I think right now we're at the point where novelcrafter is an excellent proxy for the best models for readers, because LLMs are still mostly losing engagement due to technical errors as opposed to subjective ones:

That's repetition problems, moralizing/soft-censorship, grammatical quirks, missing instructions, forgetting major plot points, etc.

Those kinds of errors are so obvious you can almost rank these models with an N=1 vibe test, and they limit how much people will consume unless you're scratching certain itches like NSFW

-

However I do think with enough post-training you can beat that level of problems and move to a stage where the writing is technically sound (and that's what I've spent most of the last year working on).

From there you get to more challenging problems that require much more feedback along some level of specialization per user (like what Midjourney does during onboarding to build up a style profile). Once you're not making technical mistakes, you now have to codify the ethereal concept of "user taste", and that will be a really interesting challenge for LLMs.

Jetwu · 13 days ago
Thanks for the comment! Do you mind linking the site - would love to check it out! That's a very fair point about the technical error aspect. Though with all the confounding variables (author skill differences, model selection based on price/speed, etc.) I'd say it's probably the most mature signal we have right now, but still far from ideal.

Really interested in what you've been working on for the past year! Are you doing custom fine-tuning or more on the prompting/post-processing side? Also I definitely need to check out the Midjourney onboarding, it sounds super interesting for inspo regarding your point about personalization + taste!

Jetwu commented on Show HN: Evaluating LLMs on creative writing via reader usage, not benchmarks   narrator.sh/... · Posted by u/Jetwu
Der_Einzige · 13 days ago
Btw, creative writing is something where good sampler settings uniquely improve your experience a lot. That's why the coomer/ERP crowd is usually the first to implement a new sampler technique.

You should explore high temperature (far above 2) sampling with good truncations like min_p, top n sigma, TFS, mirostat, typicality sampling, etc. Basically anything that isn't top_p/top_k. This is the path to highly diverse outputs.

Jetwu · 13 days ago
This is an amazing suggestion! Will definitely try to figure out a way to incorporate this into the leaderboard without making it a constant each time. I'm currently using OpenRouter's default parameters which is totally a brainfart on my part.
Jetwu commented on Show HN: Evaluating LLMs on creative writing via reader usage, not benchmarks   narrator.sh/... · Posted by u/Jetwu
johnnyfeng · 13 days ago
Nice approach! Reader engagement beats synthetic benchmarks any day. Bookmarked to try later - curious which models actually hook readers vs just score well on tests.
Jetwu · 13 days ago
Thanks Johnny! I totally agree with you, really appreciate you for checking out my project!

u/Jetwu

KarmaCake day8September 11, 2023View Original