bluecoconut (u/bluecoconut)

bluecoconut commented on Google Titans architecture, helping AI have long-term memory research.google/blog/tita... · Posted by u/Alifatisk

okdood64 · 17 days ago

From the blog:

https://arxiv.org/abs/2501.00663

https://arxiv.org/pdf/2504.13173

Is there any other company that's openly publishing their research on AI at this level? Google should get a lot of credit for this.

bluecoconut · 17 days ago

Bytedance is publishing pretty aggressively.

Recently, my favorite from them was lumine: https://arxiv.org/abs/2511.08892

Here's their official page: https://seed.bytedance.com/en/research

bluecoconut commented on DeepSeek OCR github.com/deepseek-ai/De... · Posted by u/pierre

ellisd · 2 months ago

The paper makes no mention of Anna’s Archive. I wouldn’t be surprised if DeepSeek took advantage of Anna’s offer granting OCR researchers access to their 7.5 million (350 TB) Chinese non-fiction collection ... which is bigger than Library Genesis.

https://annas-archive.org/blog/duxiu-exclusive.html

bluecoconut · 2 months ago

Previous paper from DeepSeek has mentioned Anna’s Archive.

> We cleaned 860K English and 180K Chinese e-books from Anna’s Archive (Anna’s Archive, 2024) alongside millions of K-12 education exam questions. https://arxiv.org/abs/2403.05525 DeepSeek-VL paper

bluecoconut commented on Building your own CLI coding agent with Pydantic-AI martinfowler.com/articles... · Posted by u/vinhnx

bluecoconut · 4 months ago

After maintaining my own agents library for a while, I’ve switched over to pydantic ai recently. I have some minor nits, but overall it's been working great for me. I’ve especially liked combining it with langfuse.

Towards coding agents, I wonder if there are any good / efficient ways to measure how much different implementations work on coding? SWE-bench seems good, but expensive to run. Effectively I’m curious for things like: given tool definition X vs Y (eg. diff vs full file edit), prompt for tool X vs Y (how it’s described, does it use examples), model choice (eg. MCP with Claude, but python-exec inline with GPT-5), sub-agents, todo lists, etc. how much across each ablation, does it matter? And measure not just success, but cost to success too (efficiency).

Overall, it seems like in the phase space of options, everything “kinda works” but I’m very curious if there are any major lifts, big gotchas, etc.

I ask, because it feels like the Claude code cli always does a little bit better, subjectively for me, but I haven’t seen a LLMarena or clear A vs B, comparison or measure.

bluecoconut commented on Yamanot.es: A music box of train station melodies from the JR Yamanote Line yamanot.es/... · Posted by u/zdw

bluecoconut · 4 months ago

The first time I got off at and heard Komagome's tune I mistakenly thought it was some halloween special because it was late October at the time, and the song felt so distinct and unique.

bluecoconut commented on Yamanot.es: A music box of train station melodies from the JR Yamanote Line yamanot.es/... · Posted by u/zdw

ranger_danger · 4 months ago

Another one: https://yamanote.app/

bluecoconut · 4 months ago

Interestingly this one seems it is from before 高輪ゲートウェイ (Takanawa Gateway) station which opened in 2020, but the numbering shows the gap (JY 25 -> JY 27). That led me to looking it up, and turns out that they introduced the numbering in 2016, and that already came pre-planned with the gap ready [1].

[1] https://www.jreast.co.jp/press/2016/20160402.pdf

bluecoconut commented on Open models by OpenAI openai.com/open-models/... · Posted by u/lackoftactics

andai · 5 months ago

You mentioned "on local agents". I've noticed this too. How do ChatGPT and the others get around this, and provide instant responses on long conversations?

bluecoconut · 5 months ago

Not getting around it, just benefiting from parallel compute / huge flops of GPUs. Fundamentally, it's just that prefill compute is itself highly parallel and HBM is just that much faster than LPDDR. Effectively H100s and B100s can chew through the prefill in under a second at ~50k token lengths, so the TTFT (Time to First Token) can feel amazingly fast.

bluecoconut commented on Open models by OpenAI openai.com/open-models/... · Posted by u/lackoftactics

bluecoconut · 5 months ago

I was able to get gpt-oss:20b wired up to claude code locally via a thin proxy and ollama.

It's fun that it works, but the prefill time makes it feel unusable. (2-3 minutes per tool-use / completion). Means a ~10-20 tool-use interaction could take 30-60 minutes.

(This editing a single server.py file that was ~1000 lines, the tool definitions + claude context was around 30k tokens input, and then after the file read, input was around ~50k tokens. Definitely could be optimized. Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)

bluecoconut commented on AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms deepmind.google/discover/... · Posted by u/Fysi

joelthelion · 7 months ago

This looks like something that can (and should) be reimplemented open-source. It doesn't look like a particularly daunting project.

bluecoconut · 7 months ago

I've been working on something very similar as a tool for my own AI research -- though I don't have the success they claim. Mine often plateaus on the optimization metric. I think there's secret sauce in the meta-prompting and meta-heuristic comments from the paper that are quite vague, but it makes sense -- it changes the dynamics of the search space and helps the LLM get out of ruts. I'm now going to try to integrate some ideas based off of my interpretation of their work to see how it goes.

If it goes well, I could open source it.

What are the things you would want to optimize with such a framework? (So far I've been focusing on optimizing ML training and architecture search itself). Hearing other ideas would help motivate me to open source if there's real demand for something like this.

bluecoconut commented on Whisky is no longer actively maintained docs.getwhisky.app/mainte... · Posted by u/ahamez

bluecoconut · 9 months ago

I’ve been using whisky to play Elden ring on my M4 MBP and it’s been great! I love that the Game porting toolkit and wine all work so well. I did have to do some pinning of steam to an older version to keep it working recently. I guess I’ll move over to crossover soon

bluecoconut commented on ForeverVM: Run AI-generated code in stateful sandboxes that run forever forevervm.com/... · Posted by u/paulgb

bluecoconut · 10 months ago

I tried to do this myself about ~1.5 years ago, but ran into issues with capturing state for sockets and open files (which started to show up when using some data science packages, jupyter widgets, etc.)

What are some of the edge cases where ForeverVM works and doesn't work? I don't see anything in the documentation about installing new packages, do you pre-bake what is available, and how can you see what libraries are available?

I do like that it seems the ForeverVM REPL also captures the state of the local drive (eg. can open a file, write to it, and then read from it).

For context on what I've tried: I used CRIU[1] to make the dumps of the process state and then would reload them. It worked for basic things, but ran into the issues stated above and abandoned the project. (I was trying to create a stack / undo context for REPLs that LLMs could use, since they often put themselves into bad states, and reverting to previous states seemed useful). If I remember correctly, I also ran into issues because capturing the various outputs (ipython capture_output concepts) proved to be difficult outside of a jupyter environment, and jupyter environments themselves were even harder to snapshot. In the end I settled for ephemeral but still real-server jupyter kernels where I via wrapper managed locals() and globals() as a cache, and would re-execute commands in order to rebuild state after the server restarts / crashes. This allowed me to also pip install new packages as well, so it proved more useful than simply static building my image/environment. But, I did lose the "serialization" property of the machine state, which was something I wanted.

That said, even though I personally abanonded the project, I still hold onto the dream of a full Tree/Graph of VMs (where each edge is code that is executed), and each VM state can be analyzed (files, memory, etc.). Love what ForeverVM is doing and the early promise here.

[1] https://criu.org/Main_Page