joshhart (u/joshhart)

joshhart commented on Nano Banana Pro blog.google/technology/ai... · Posted by u/meetpateltech

joshhart · a month ago

This is super awesome, but how in the world did they come up with a name "Nano Banana Pro"? It sounds like an April Fools joke.

joshhart commented on A small number of samples can poison LLMs of any size anthropic.com/research/sm... · Posted by u/meetpateltech

boringg · 2 months ago

Can anyone tell me why anthropic is releasing this information? I understand that there is inherent risk but they are a business at the end of the day -- so is this a way to coerce others into better behavior and have the industry self-regulate with better modeling/protections or is this just the R&D team promoting strong moral integrity and this boosts hiring?

There is clearly a strategy here - and I'm trying to figure it out.

Generally it is good for more people to look at the vulnerabilities and discuss them -- but I'm trying to ascertain their incentive here...

joshhart · 2 months ago

I believe it's intended to convince the audience they are experts, that this type of thing is dangerous to a business, and they are the ones doing the most to prevent it. There is no explicit statement to this effect, but I get the sense they are saying that other vendors, and especially open models that haven't done the work to curate the data as much, are vulnerable to attacks that might hurt your business.

Also a recruiting and branding effort.

All of this is educated guesses, but that's my feeling. I do think the post could have been clearer about describing the practical dangers of poisoning. Is it to spew misinformation? Is it to cause a corporate LLM powered application to leak data it shouldn't? Not really sure here.

joshhart commented on Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally? · Posted by u/superasn

bawana · 4 months ago

How is speculative decoding helpful if you still have to run the full model against which you check the results?

joshhart · 4 months ago

So the inference speed at low to medium usage is memory bandwidth bound, not compute bound. By “forecasting” into the future you do not increase the memory bandwidth pressure much but you use more compute. The compute is checking each potential token in parallel for several tokens forward. That compute is essentially free though because it’s not the limiting resource. Hope this makes sense, tried to keep it simple.

joshhart commented on Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally? · Posted by u/superasn

joshhart · 4 months ago

A single node with GPUs has a lot of FLOPs and very high memory bandwidth. When only processing a few requests at a time, the GPUs are mostly waiting on the model weights to stream from the GPU ram to the processing units. When batching requests together, they can stream a group of weights and score many requests in parallel with that group of weights. That allows them to have great efficiency.

Some of the other main tricks - compress the model to 8 bit floating point formats or even lower. This reduces the amount of data that has to stream to the compute unit, also newer GPUs can do math in 8-bit or 4-bit floating point. Mixture of expert models are another trick where for a given token, a router in the model decides which subset of the parameters are used so not all weights have to be streamed. Another one is speculative decoding, which uses a smaller model to generate many possible tokens in the future and, in parallel, checks whether some of those matched what the full model would have produced.

Add all of these up and you get efficiency! Source - was director of the inference team at Databricks

joshhart commented on Apple M3 Ultra apple.com/newsroom/2025/0... · Posted by u/ksec

joshhart · 9 months ago

This is pretty exciting. Now an organization could produce an open weights mixture of experts model that has 8-15b active parameters but could still be 500b+ parameters and it could be run locally with INT4 quantization with very fast performance. DeepSeek R1 is a similar model but over 30b active parameters which makes it a little slow.

I do not have a good sense of how well quality scales with narrow MoEs but even if we get something like Llama 3.3 70b in quality at only 8b active parameters people could do a ton locally.

joshhart commented on Llama-3.3-70B-Instruct huggingface.co/meta-llama... · Posted by u/pr337h4m

profsummergig · a year ago

Please help me understand something.

I've been out of the loop with HuggingFace models.

What can you do with these models?

1. Can you download them and run them on your Laptop via JupyterLab?

2. What benefits does that get you?

3. Can you update them regularly (with new data on the internet, e.g.)?

4. Can you finetune them for a specific use case (e.g. GeoSpatial data)?

5. How difficult and time-consuming (person-hours) is it to finetune a model?

(If HuggingFace has answers to these questions, please point me to the URL. HuggingFace, to me, seems like the early days of GitHub. A small number were heavy users, but the rest were left scratching their heads and wondering how to use it.)

Granted it's a newbie question, but answers will be beneficial to a lot of us out there.

joshhart · a year ago

Hi,

Yes you can. The community creates quantized variants of these that can run on consumer GPUs. A 4-bit quantization of LLAMA 70b works pretty well on Macbook pros, the neural engine with unified CPU memory is quite solid for these. GPUs is a bit tougher because consumer GPU RAM is still kinda small.

You can also fine-tune them. There are lot of frameworks like unsloth that make this easier. https://github.com/unslothai/unsloth . Fine-tuning can be pretty tricky to get right, you need to be aware of things like learning rates, but there are good resources on the internet where a lot of hobbyists have gotten things working. You do not need a PhD in ML to accomplish this. You will, however, need data that you can represent textually.

Source: Director of Engineering for model serving at Databricks.

joshhart commented on DeepSeek v2.5 – open-source LLM comparable to GPT-4, but 95% less expensive deepseek.com/... · Posted by u/jchook

joshhart · a year ago

The benchmarks compare it favorably to GPT-4-turbo but not GPT-4o. The latest versions of GPT-4o are much higher in quality than GPT-4-turbo. The HN title here does not reflect what the article is saying.

That said the conclusion that it's a good model for cheap is true. I just would be hesitant to say it's a great model.

joshhart commented on Ask HN: How do you add guard rails in LLM response without breaking streaming? · Posted by u/curious-tech-12

joshhart · a year ago

Hi, I run the model serving team at Databricks. Usually you run regex filters, LLAMA Guard, etc on chunks at a time so you are still streaming but it's in batches of tokens rather than single tokens at a time. Hope that helps!

You could of course use us and get that out of the box if you have access to Databricks.