Given that the author is using the specific `box_2d` format, it suggests that he is taking advantage of this feature, so I wanted to highlight it. My intuition is that a base multimodal LLM without this type of post-training would have much worse performance.
It has been tuned so heavily on this specific format that even a tiny change, like switching the order in the `box_2d` format from `(ymin, xmin, ymax, xmax)` to `(xmin, ymin, xmax, ymax)` causes performance to tank.
Gemini is really impressive at these kinds of object detection tasks.
https://www.sergey.fyi/articles/using-gemini-for-precise-cit...
Are you using that approach in production for grounding when PDFs don't include embedded text, like in the case of scanned documents? I did some experiments for that use case, and it wasn't really reaching the bar I was hoping for.
Has anyone here found good ways to handle bounding box quality in noisy datasets? Do you rely more on human annotation or clever augmentation?
In some cases, running a model like SAM 2 on a loose bounding box can help refine the results. I usually add about 10% padding in each direction to the bounding box, just in case the original was too tight. Then if you don't actually need to mask you just convert it back to a bounding box.
I should have caught that, and there are probably other bugs too waiting to be found. That said, it's still a great recipe.
Pounds of lamb become kilograms (more than doubling the quantity of meat), a medium onion turns large, one celery stalk becomes two, six cloves of garlic turn into four, tomato paste vanishes, we lose nearly half a cup of wine, beef stock gets an extra ¾ cup, rosemary is replaced with oregano.
The recipe site was so long that it got truncated before being sent to the LLM. Then, based on the first 8000 characters, Gemini hallucinated the rest of the recipe, it was definitely in its training set.
I have fixed it and pushed a new version of the project. Thanks again, it really highlights how we can never fully trust models.
You should also have some way for the LLM to indicate there is no useful output because perhaps the page is supposed to be a SPA. This would force you to execute Javascript to render that particular page though