simedw (u/simedw) - Readit News

simedw commented on Is Gemini 2.5 good at bounding boxes? simedw.com/2025/07/10/gem... · Posted by u/simedw

thegeomaster · 2 months ago

A detail that is not mentioned is that Google models >= Gemini 2.0 are all explicitly post-trained for this task of bounding box detection: https://ai.google.dev/gemini-api/docs/image-understanding

Given that the author is using the specific `box_2d` format, it suggests that he is taking advantage of this feature, so I wanted to highlight it. My intuition is that a base multimodal LLM without this type of post-training would have much worse performance.

simedw · 2 months ago

That's true, it's also why I didn't benchmark against any other model provider.

It has been tuned so heavily on this specific format that even a tiny change, like switching the order in the `box_2d` format from `(ymin, xmin, ymax, xmax)` to `(xmin, ymin, xmax, ymax)` causes performance to tank.

simedw commented on Is Gemini 2.5 good at bounding boxes? simedw.com/2025/07/10/gem... · Posted by u/simedw

serjester · 2 months ago

I wrote a similar article a couple of months ago, but focusing instead on PDF bounding boxes—specifically, drawing boxes around content excerpts.

Gemini is really impressive at these kinds of object detection tasks.

https://www.sergey.fyi/articles/using-gemini-for-precise-cit...

simedw · 2 months ago

That's really interesting, thanks for sharing!

Are you using that approach in production for grounding when PDFs don't include embedded text, like in the case of scanned documents? I did some experiments for that use case, and it wasn't really reaching the bar I was hoping for.

simedw commented on Is Gemini 2.5 good at bounding boxes? simedw.com/2025/07/10/gem... · Posted by u/simedw

EconomistFar · 2 months ago

Really interesting piece, the bit about tight vs loose bounding boxes got me thinking. Small inaccuracies can add up fast, especially in edge cases or when training on limited data.

Has anyone here found good ways to handle bounding box quality in noisy datasets? Do you rely more on human annotation or clever augmentation?

simedw · 2 months ago

Thank you! Better training data is often the key to solving these issues, though it can be a costly solution.

In some cases, running a model like SAM 2 on a loose bounding box can help refine the results. I usually add about 10% padding in each direction to the bounding box, just in case the original was too tight. Then if you don't actually need to mask you just convert it back to a bounding box.

Posted by u/simedw 2 months ago

Is Gemini 2.5 good at bounding boxes?simedw.com/2025/07/10/gem...

simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages simedw.com/2025/06/23/int... · Posted by u/simedw

andrepd · 2 months ago

Which it apparently does by completely changing the recipe in random places including ingredients and amounts thereof. It is _indeed_ a very good microcosm of what LLMs are, just not in the way these comments think.

simedw · 2 months ago

It was actually a bit worse than that the LLM never got the full recipe due to some truncation logic I had added. So it regurgitated the recipe from training, and apparently, it couldn't do both that and convert units at the same time with the lite model (it worked for just flash).

I should have caught that, and there are probably other bugs too waiting to be found. That said, it's still a great recipe.

simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages simedw.com/2025/06/23/int... · Posted by u/simedw

mossTechnician · 2 months ago

Changes Spegel made to the linked recipe's ingredients:

Pounds of lamb become kilograms (more than doubling the quantity of meat), a medium onion turns large, one celery stalk becomes two, six cloves of garlic turn into four, tomato paste vanishes, we lose nearly half a cup of wine, beef stock gets an extra ¾ cup, rosemary is replaced with oregano.

simedw · 2 months ago

Fantastic catch! It led me down a rabbit hole, and I finally found the root cause.

The recipe site was so long that it got truncated before being sent to the LLM. Then, based on the first 8000 characters, Gemini hallucinated the rest of the recipe, it was definitely in its training set.

I have fixed it and pushed a new version of the project. Thanks again, it really highlights how we can never fully trust models.

simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages simedw.com/2025/06/23/int... · Posted by u/simedw

coder543 · 2 months ago

Just a typo note: the flow diagram in the article says "Gemini 2.5 Pro Lite", but there is no such thing.

simedw · 2 months ago

You are right, it's Gemini 2.5 Flash Lite

simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages simedw.com/2025/06/23/int... · Posted by u/simedw

__MatrixMan__ · 2 months ago

It would be cool of it were smart enough to figure out whether it was necessary to rewrite the page on every visit. There's a large chunk of the web where one of us could visit once, rewrite to markdown, and then serve the cleaned up version to each other without requiring a distinct rebuild on each visit.

simedw · 2 months ago

If the goal is to have a more consistent layout on each visit, I think we could save the last page's markdown and send it to the model as a one-shot example...

simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages simedw.com/2025/06/23/int... · Posted by u/simedw

nextaccountic · 2 months ago

In your cleanup step, after cleaning obvious junk, I think you should do whatever Firefox's reader mode does to further clean up, and if that fails bail out to the current output. That should reduce the number of tokens you send to the LLM even more

You should also have some way for the LLM to indicate there is no useful output because perhaps the page is supposed to be a SPA. This would force you to execute Javascript to render that particular page though

simedw · 2 months ago

Just had a look and three is quite a lot going into Firefox's reader mode.

https://github.com/mozilla/readability

simedw commented on Show HN: Spegel, a Terminal Browser That Uses LLMs to Rewrite Webpages simedw.com/2025/06/23/int... · Posted by u/simedw

stared · 2 months ago

Any chance it would work for pages like Facebook or LinkedIn? I would love to have a distraction-free way of searching information there.

Obviously, against wishes of these social networks, which want us to be addicted... I mean, engaged.

simedw · 2 months ago

We’ll probably have to add some custom code to log in, get an auth token, and then browse with it. Not sure if LinkedIn would like that, but I certainly would.

u/simedw

KarmaCake day269June 20, 2012

About

CTO/co-founder of V7, before that Aipoly.

blog: simedw.com X: https://x.com/simedw

View Original