Exactly which part of their writings comes off as arrogant to you? The only point in Amodei's article[0] that could be remotely be interpreted as arrogant is this:
All of this is to say that DeepSeek-V3 is not a unique breakthrough or something that fundamentally changes the economics of LLM’s; it’s an expected point on an ongoing cost reduction curve. What’s different this time is that the company that was first to demonstrate the expected cost reductions was Chinese.
Maybe I'm different, but it really does sound reasonable judgement to me.[0]: https://darioamodei.com/on-deepseek-and-export-controls#deep...
But seriously, I wonder how much acoustic energy - that they point out you can't hear - is still entering your ear and interacting with the delicate structures there.
Dead Comment
So it doesn't really, as the title claims, turn recordings into images (it already has the images) and the distorted fake images it creates are only "accurate" in that they broadly slot into the right category in terms of urban/rural setting, amount of greenery and amount of sky shown.
It sounds like the matching is the useful part and the "generative" part is just a huge disadvantage. The paper doesn't seem to say if the LLM is any better than other types of models at the matching part.
But, no demo anywhere?