Readit News logoReadit News
simedw commented on Is Gemini 2.5 good at bounding boxes?   simedw.com/2025/07/10/gem... · Posted by u/simedw
thegeomaster · 2 months ago
A detail that is not mentioned is that Google models >= Gemini 2.0 are all explicitly post-trained for this task of bounding box detection: https://ai.google.dev/gemini-api/docs/image-understanding

Given that the author is using the specific `box_2d` format, it suggests that he is taking advantage of this feature, so I wanted to highlight it. My intuition is that a base multimodal LLM without this type of post-training would have much worse performance.

simedw · 2 months ago
That's true, it's also why I didn't benchmark against any other model provider.

It has been tuned so heavily on this specific format that even a tiny change, like switching the order in the `box_2d` format from `(ymin, xmin, ymax, xmax)` to `(xmin, ymin, xmax, ymax)` causes performance to tank.

simedw commented on Is Gemini 2.5 good at bounding boxes?   simedw.com/2025/07/10/gem... · Posted by u/simedw
serjester · 2 months ago
I wrote a similar article a couple of months ago, but focusing instead on PDF bounding boxes—specifically, drawing boxes around content excerpts.

Gemini is really impressive at these kinds of object detection tasks.

https://www.sergey.fyi/articles/using-gemini-for-precise-cit...

simedw · 2 months ago
That's really interesting, thanks for sharing!

Are you using that approach in production for grounding when PDFs don't include embedded text, like in the case of scanned documents? I did some experiments for that use case, and it wasn't really reaching the bar I was hoping for.

simedw commented on Is Gemini 2.5 good at bounding boxes?   simedw.com/2025/07/10/gem... · Posted by u/simedw
EconomistFar · 2 months ago
Really interesting piece, the bit about tight vs loose bounding boxes got me thinking. Small inaccuracies can add up fast, especially in edge cases or when training on limited data.

Has anyone here found good ways to handle bounding box quality in noisy datasets? Do you rely more on human annotation or clever augmentation?

simedw · 2 months ago
Thank you! Better training data is often the key to solving these issues, though it can be a costly solution.

In some cases, running a model like SAM 2 on a loose bounding box can help refine the results. I usually add about 10% padding in each direction to the bounding box, just in case the original was too tight. Then if you don't actually need to mask you just convert it back to a bounding box.

simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages   simedw.com/2025/06/23/int... · Posted by u/simedw
andrepd · 2 months ago
Which it apparently does by completely changing the recipe in random places including ingredients and amounts thereof. It is _indeed_ a very good microcosm of what LLMs are, just not in the way these comments think.
simedw · 2 months ago
It was actually a bit worse than that the LLM never got the full recipe due to some truncation logic I had added. So it regurgitated the recipe from training, and apparently, it couldn't do both that and convert units at the same time with the lite model (it worked for just flash).

I should have caught that, and there are probably other bugs too waiting to be found. That said, it's still a great recipe.

simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages   simedw.com/2025/06/23/int... · Posted by u/simedw
mossTechnician · 2 months ago
Changes Spegel made to the linked recipe's ingredients:

Pounds of lamb become kilograms (more than doubling the quantity of meat), a medium onion turns large, one celery stalk becomes two, six cloves of garlic turn into four, tomato paste vanishes, we lose nearly half a cup of wine, beef stock gets an extra ¾ cup, rosemary is replaced with oregano.

simedw · 2 months ago
Fantastic catch! It led me down a rabbit hole, and I finally found the root cause.

The recipe site was so long that it got truncated before being sent to the LLM. Then, based on the first 8000 characters, Gemini hallucinated the rest of the recipe, it was definitely in its training set.

I have fixed it and pushed a new version of the project. Thanks again, it really highlights how we can never fully trust models.

simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages   simedw.com/2025/06/23/int... · Posted by u/simedw
coder543 · 2 months ago
Just a typo note: the flow diagram in the article says "Gemini 2.5 Pro Lite", but there is no such thing.
simedw · 2 months ago
You are right, it's Gemini 2.5 Flash Lite
simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages   simedw.com/2025/06/23/int... · Posted by u/simedw
__MatrixMan__ · 2 months ago
It would be cool of it were smart enough to figure out whether it was necessary to rewrite the page on every visit. There's a large chunk of the web where one of us could visit once, rewrite to markdown, and then serve the cleaned up version to each other without requiring a distinct rebuild on each visit.
simedw · 2 months ago
If the goal is to have a more consistent layout on each visit, I think we could save the last page's markdown and send it to the model as a one-shot example...
simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages   simedw.com/2025/06/23/int... · Posted by u/simedw
nextaccountic · 2 months ago
In your cleanup step, after cleaning obvious junk, I think you should do whatever Firefox's reader mode does to further clean up, and if that fails bail out to the current output. That should reduce the number of tokens you send to the LLM even more

You should also have some way for the LLM to indicate there is no useful output because perhaps the page is supposed to be a SPA. This would force you to execute Javascript to render that particular page though

simedw · 2 months ago
Just had a look and three is quite a lot going into Firefox's reader mode.

https://github.com/mozilla/readability

simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages   simedw.com/2025/06/23/int... · Posted by u/simedw
stared · 2 months ago
Any chance it would work for pages like Facebook or LinkedIn? I would love to have a distraction-free way of searching information there.

Obviously, against wishes of these social networks, which want us to be addicted... I mean, engaged.

simedw · 2 months ago
We’ll probably have to add some custom code to log in, get an auth token, and then browse with it. Not sure if LinkedIn would like that, but I certainly would.

u/simedw

KarmaCake day269June 20, 2012
About
CTO/co-founder of V7, before that Aipoly.

blog: simedw.com X: https://x.com/simedw

View Original