ilyakaminsky (u/ilyakaminsky)

ilyakaminsky commented on Nano Banana can be prompt engineered for nuanced AI image generation minimaxir.com/2025/11/nan... · Posted by u/minimaxir

sorcercode · 4 months ago

@simonw: slight tangent but super curious how you managed to generate the preview of that gemini-cli terminal session gist - https://gistpreview.github.io/?17290c1024b0ef7df06e9faa4cb37...

is this just a manual copy/paste into a gist with some html css styling; or do you have a custom tool à la amp-code that does this more easily?

ilyakaminsky · 4 months ago

I use Gemini CLI on a daily basis. It used to crash often and I'd lose the chat history. I found this tool called ai-cli-log [1] and it does something similar out of the box. I don't run Gemini CLI without it.

[1] https://github.com/alingse/ai-cli-log

ilyakaminsky commented on Launch HN: Strata (YC X25) – One MCP server for AI to handle thousands of tools · Posted by u/wirehack

wirehack · 6 months ago

Strata supports connecting to custom external MCP servers via API: https://docs.klavis.ai/api-reference/strata/create. For the servers on our website, most of them are created by ourselves.

ilyakaminsky · 6 months ago

How can I submit my service to your website? Is there a simpler way than creating a PR here? https://github.com/Klavis-AI/klavis/tree/main/mcp_servers

ilyakaminsky commented on Show HN: Whispering – Open-source, local-first dictation you can trust github.com/epicenter-so/e... · Posted by u/braden-w

mrgaro · 7 months ago

I'd love to find a tool which could recognise a few different speakers so that I could automatically dictate 1:1 sessions. In addition, I definitively would want to feed that to an LLM to cleanup the notes (to remove all "umm" and similar nonsense) and to do context aware spell checking.

The LLM part should be very much doable, but I'm not sure if speaker recognition exists in a sufficiently working state?

ilyakaminsky · 7 months ago

Shameless plug -- check out speechischeap.com

I spent three months perfecting the speaker diarization pipeline and I think you'll be quite pleased with the results.

ilyakaminsky commented on Fast catherinejue.com/fast... · Posted by u/gaplong

sipjca · 7 months ago

ive approached the same thing but slightly differently. i can run it on consumer hardware for vastly cheaper than the cloud and don't have to worry about image sizes at all. (bare metal is 'faster') offering 20,000 minutes of transcription for free up to the rate limit (1 Request Every 5 Seconds)

https://geppetto.app

I contributed "whisperfile" as a result of this work:

* https://github.com/Mozilla-Ocho/llamafile/tree/main/whisper....

* https://github.com/cjpais/whisperfile

if you ever want to chat about making transcription virtually free or so cheap for everyone let me know. I've been working on various projects related to it for a while. including open source/cross-platform superwhisper alternative https://handy.computer

ilyakaminsky · 7 months ago

> i can run it on consumer hardware for vastly cheaper than the cloud

Woah, that's really cool, CJ! I've been toying the with idea of standing up a cluster of older iPhones to run Apple's Speech framework. [1] The inspiration came from this blog post [2] where the author is using it for OCR. A couple of things are holding me back: (1) the OSS models are better according to the current benchmarks and (2) I have customers all over the world, so that geographical load-balancing is a real factor. With that said, I'll definitely spend some time checking out your work. Thanks for sharing!

[1] https://developer.apple.com/documentation/speech

[2] https://terminalbytes.com/iphone-8-solar-powered-vision-ocr-...

ilyakaminsky commented on Fast catherinejue.com/fast... · Posted by u/gaplong

willsmith72 · 7 months ago

Not in development and maintenance dollars it's not

ilyakaminsky · 7 months ago

Hmm… That's a good point. I recall a few instances where I went too far to the detriment of production. Having a trusty testing and benchmarking suite thankfully helped with keeping things more stable. As a solo developer, I really enjoy the development process, so while that bit is costly, I didn't really consider that until you mentioned it.

ilyakaminsky commented on Problem solving using Markov chains (2007) [pdf] math.uchicago.edu/~shmuel... · Posted by u/Alifatisk

mindcrime · 7 months ago

For the Monte Carlo Method stuff in particular[1], I get the sense that the most iconic "Hello, World" example is using MC to calculate the value of pi. I can't explain it in detail from memory, but it's approximately something like this:

Define a square of some known size (1x1 should be fine, I think)

Inscribe a circle inside the square

Generate random points inside the square

Look at how many fall inside the square but not the circle, versus the ones that do fall in the circle.

From that, using what you know about the area of the square and circle respectively, the ratio of "inside square but not in circle" and "inside circle" points can be used to set up an equation for the value of pi.

Somebody who's more familiar with this than me can probably fix the details I got wrong, but I think that's the general spirit of it.

For Markov Chains in general, the only thing that jumps to mind for me is generating text for old school IRC bots. :-)

[1]: which is probably not the point of this essay. For for muddying the waters, I have both concepts kinda 'top of mind' in my head right now after watching the Veritasium video.

ilyakaminsky · 7 months ago

TIL, thanks! I asked Claude to generate a simulator [1] based on your comment. I think it came out well.

[1] https://claude.ai/public/artifacts/1b921a50-897e-4d9e-8cfa-0...

ilyakaminsky commented on Fast catherinejue.com/fast... · Posted by u/gaplong

ilyakaminsky · 7 months ago

Fast is also cheap. Especially in the world of cloud computing where you pay by the second. The only way I could create a profitable transcription service [1] that undercuts the rest was by optimizing every little thing along the way. For instance, just yesterday I learned that the image size I've put together is 2.5× smaller than the next open source variant. That means faster cold boots, which reduces the cost (and providers a better service).

[1] https://speechischeap.com

ilyakaminsky commented on Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic github.com/openai/whisper... · Posted by u/edent

poly2it · 8 months ago

How does this make a profit? Whisper should be $0.006 to $0.010 per minute, but you rate less than $0.001? Do you 10x the audio?

ilyakaminsky · 8 months ago

Thanks for noticing. It took a lot of effort to optimize the pipeline every step of the way. VAD, inference server, hardware optimization, etc. But nothing that would compromise on quality. The audio is currently transcribed in its original speed. I'll be sure to publish something if I manage to speed it up without incurring any losses to the WER.

ilyakaminsky commented on Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic github.com/openai/whisper... · Posted by u/edent

dandiep · 8 months ago

Whisper is unusable IMO because of the hallucinations. Widely documented. Removing silence from audio clips helps, but even then it will auto correct grammar, translating bilingual speech, etc. Improved in the latest audio models but not solved [1]

1. https://news.ycombinator.com/item?id=43427376

ilyakaminsky · 8 months ago

I wouldn't describe it as "unusable" so much as needing to understand its constraints and how to work around them. I built a business on top of Whisper [1] and one of the early key insights was to implement a good voice activity detection (VAD) model in order to reduce Whisper's hallucinations on silence.

[1] https://speechischeap.com

ilyakaminsky commented on OpenAI charges by the minute, so speed up your audio george.mand.is/2025/06/op... · Posted by u/georgemandis

satvikpendem · 8 months ago

Can it do real-time transcription with diarization? I'm looking for that for a product feature I'm working on. Currently I've seen Speechmatics do this well, haven't heard of others.

ilyakaminsky · 8 months ago

Not yet. The gains in efficiency come from optimizing the speedup factor. Real-time audio cannot be processed any faster than 1× by definition.