Readit News logoReadit News
huac commented on Some critical issues with the SWE-bench dataset   arxiv.org/abs/2410.06992... · Posted by u/joshwa
sebzim4500 · a year ago
LLMs do not reliably reproduce their training data. This is quite easy to demonstrate, every LLM has been trained on all of wikipedia (at minimum) and yet there if you ask it a niche fact mentioned once on wikipedia it is highly likely to get it wrong.
huac · a year ago
that comment refers to the test time inference, i.e. what the model is prompted with, not to what it is trained on. this is, of course, also a tricky problem (esp over long context, needle in a haystack), but it should be much easier than memorization.

anyways, another interpretation is that the model needs to also make a decision on if the code in the issue is a reliable fix or not too

huac commented on Some critical issues with the SWE-bench dataset   arxiv.org/abs/2410.06992... · Posted by u/joshwa
huac · a year ago
> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.

Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?

huac commented on Test-driven development with an LLM for fun and profit   blog.yfzhou.fyi/posts/tdd... · Posted by u/crazylogger
xianshou · a year ago
One trend I've noticed, framed as a logical deduction:

1. Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.

2. Coding agents do massively better when they have a test-driven reward signal.

3. If a problem can be framed in a way that a coding agent can solve, that speeds up development at least 10x from the base case of human + assistant.

4. From (1)-(3), if you can get all the necessary context into 50k tokens and measure progress via tests, you can speed up development by 10x.

5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.

Sure enough, I see HN projects evolving in that direction.

huac · a year ago
> Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.

I had a very similar impression (wrote more in https://hua.substack.com/p/are-longer-context-windows-all-yo...).

One framing is that effective context window (i.e. the length that the model is able to effectively reason over) determines how useful the model is. A human new grad programmer might effectively reason over 100s or 1000s of tokens but not millions - which is why we carefully scope the work and explain where to look for relevant context only. But a principal engineer might reason over many many millions of context - code yes, but also organizational and business context.

Trying to carefully select those 50k tokens is extremely difficult for LLMs/RAG today. I expect models to get much longer effective context windows but there are hardware / cost constraints which make this more difficult.

huac commented on Nepenthes is a tarpit to catch AI web crawlers   zadzmo.org/code/nepenthes... · Posted by u/blendergeek
huac · a year ago
from an AI research perspective -- it's pretty straightforward to mitigate this attack

1. perplexity filtering - small LLM looks at how in-distribution the data is to the LLM's distribution. if it's too high (gibberish like this) or too low (likely already LLM generated at low temperature or already memorized), toss it out.

2. models can learn to prioritize/deprioritize data just based on the domain name of where it came from. essentially they can learn 'wikipedia good, your random website bad' without any other explicit labels. https://arxiv.org/abs/2404.05405 and also another recent paper that I don't recall...

huac commented on Daisy, an AI granny wasting scammers' time   news.virginmediao2.co.uk/... · Posted by u/ortusdux
seabass-labrax · a year ago
I'm imagining this is just a publicity stunt, and I'll say it's a very good one. However I can't see it being very practical. There are lots of scam calls to keep up with and LLMs and text-to-speech models are expensive to run. If they do run this in production, the costs of running hundreds of 'Daisies' will inevitably be passed onto the consumer, and worse still, if the scammers are calling in through PSTN lines or cellular this will use up our already scarce bandwidth. I've frequently had difficulty connecting through trunk lines from Belgium and Germany to numbers in Britain, and that's without a legion of AI grannies sitting on the phone!
huac · a year ago
real-time full duplex like OpenAI GPT-4o is pretty expensive. cascaded approaches (usually about 800ms - 1 second delay) are slower and worse, but very very cheap. when I built this a year ago, I estimated the LLM + TTS + other serving costs to be less than the Twilio costs.
huac commented on Go library for in-process vector search and embeddings with llama.cpp   github.com/kelindar/searc... · Posted by u/kelindar
huac · a year ago
nice work! I wrote a similar library (https://github.com/stillmatic/gollum/blob/main/packages/vect...) and similarly found that exact search (w/the same simple heap + SIMD optimizations) is quite fast. with 100k objects, retrieval queries complete in <200ms on an M1 Mac. no need for a fancy vector DB :)

that library used `viterin/vek` for SIMD math: https://github.com/viterin/vek/

huac commented on Show HN: Mdx – Execute your Markdown code blocks, now in Go   github.com/dim0x69/mdx... · Posted by u/dim0x69
huac · a year ago
reminds me a lot of rmarkdown - which allows you to run many languages in a similar fashion https://rmarkdown.rstudio.com/
huac commented on Taiwan is heading toward an energy crunch?   wired.com/story/taiwan-ma... · Posted by u/vunderba
silisili · a year ago
You mentioned offshore wind, is offshore solar not a thing? Seems it'd be rather easy to float a farm of them...easier than floating a giant windmill, at least.
huac · a year ago
shouldn't there be more clouds over ocean, as that is where the clouds tend to form?
huac commented on Moshi: A speech-text foundation model for real time dialogue   github.com/kyutai-labs/mo... · Posted by u/gkucsko
artsalamander · a year ago
I've been building solutions for real-time voice -> llm -> voice output, and I think the most exciting part of what you're building is the streaming neural audio codec since you're never actually really able to stream STT with whisper.

However from a product point of view I wouldn't necessarily want to pipe that into an LLM and have it reply, I think in a lot of use-cases there needs to be a tool/function calling step before a reply. Down to chat with anyone reading this who is working along these lines!

edit: tincans as mentioned below looks excellent too

editedit: noooo apparently tincans development has ended, there's 10000% space for something in this direction - Chris if you read this please let me pitch you on the product/business use-cases this solves regardless of how good llms get...

huac · a year ago
> there needs to be a tool/function calling step before a reply

I built that almost exactly a year ago :) it was good but not fast enough - hence building the joint model.

u/huac

KarmaCake day3901March 30, 2014View Original