Readit News logoReadit News
danielcampos93 commented on Gemini 3 Pro Model Card [pdf]   storage.googleapis.com/de... · Posted by u/virgildotcodes
danielcampos93 · a month ago
mums the word on Flash?
danielcampos93 commented on Gemini 3 Pro Model Card [pdf]   storage.googleapis.com/de... · Posted by u/virgildotcodes
scrlk · a month ago
Benchmarks from page 4 of the model card:

    | Benchmark             | 3 Pro     | 2.5 Pro | Sonnet 4.5 | GPT-5.1   |
    |-----------------------|-----------|---------|------------|-----------|
    | Humanity's Last Exam  | 37.5%     | 21.6%   | 13.7%      | 26.5%     |
    | ARC-AGI-2             | 31.1%     | 4.9%    | 13.6%      | 17.6%     |
    | GPQA Diamond          | 91.9%     | 86.4%   | 83.4%      | 88.1%     |
    | AIME 2025             |           |         |            |           |
    |   (no tools)          | 95.0%     | 88.0%   | 87.0%      | 94.0%     |
    |   (code execution)    | 100%      | -       | 100%       | -         |
    | MathArena Apex        | 23.4%     | 0.5%    | 1.6%       | 1.0%      |
    | MMMU-Pro              | 81.0%     | 68.0%   | 68.0%      | 80.8%     |
    | ScreenSpot-Pro        | 72.7%     | 11.4%   | 36.2%      | 3.5%      |
    | CharXiv Reasoning     | 81.4%     | 69.6%   | 68.5%      | 69.5%     |
    | OmniDocBench 1.5      | 0.115     | 0.145   | 0.145      | 0.147     |
    | Video-MMMU            | 87.6%     | 83.6%   | 77.8%      | 80.4%     |
    | LiveCodeBench Pro     | 2,439     | 1,775   | 1,418      | 2,243     |
    | Terminal-Bench 2.0    | 54.2%     | 32.6%   | 42.8%      | 47.6%     |
    | SWE-Bench Verified    | 76.2%     | 59.6%   | 77.2%      | 76.3%     |
    | t2-bench              | 85.4%     | 54.9%   | 84.7%      | 80.2%     |
    | Vending-Bench 2       | $5,478.16 | $573.64 | $3,838.74  | $1,473.43 |
    | FACTS Benchmark Suite | 70.5%     | 63.4%   | 50.4%      | 50.8%     |
    | SimpleQA Verified     | 72.1%     | 54.5%   | 29.3%      | 34.9%     |
    | MMLU                  | 91.8%     | 89.5%   | 89.1%      | 91.0%     |
    | Global PIQA           | 93.4%     | 91.5%   | 90.1%      | 90.9%     |
    | MRCR v2 (8-needle)    |           |         |            |           |
    |   (128k avg)          | 77.0%     | 58.0%   | 47.1%      | 61.6%     |
    |   (1M pointwise)      | 26.3%     | 16.4%   | n/s        | n/s       |
n/s = not supported

EDIT: formatting, hopefully a bit more mobile friendly

danielcampos93 · a month ago
I would love to know what the increased token count is across these models for the benchmarks. I find the models continue to get better but as they do their token usage also does. Aka is model doing better or reasoning for longer?
danielcampos93 commented on Ollama Web Search   ollama.com/blog/web-searc... · Posted by u/jmorgan
simonw · 3 months ago
The reason I care about this is that different providers have different rules about how I can use the results.

Brave: https://api-dashboard.search.brave.com/terms-of-service "Licensee shall not at any time, and shall not permit others to: store the results of the API or any derivative works from the results of the API"

Exa: https://exa.ai/assets/Exa_Labs_Terms_of_Service.pdf "You may not [...] download, modify, copy, distribute, transmit, display, perform, reproduce, duplicate, publish, license, create derivative works from, or offer for sale any information contained on, or obtained from or through, the Services, except for temporary files that are automatically cached by your web browser for display purposes"

Many of the things I want to do with a search API are blocked by these rules! So I need to know which rules I am subject to.

danielcampos93 · 3 months ago
It's pretty wild that Brave's terms of service state as much, considering their search API is entirely derived from storing the results of other search systems. https://support.brave.app/hc/en-us/articles/4409406835469-Wh.... Aka Brave is blocking exactly what it does to Bing and Google.
danielcampos93 commented on Will Amazon S3 Vectors kill vector databases or save them?   zilliz.com/blog/will-amaz... · Posted by u/Fendy
storus · 3 months ago
Does this support hybrid search (dense + sparse embeddings)? Pure dense embeddings aren't that great for specific search, they only hit meaning reliably. Amazon's own embeddings also aren't SOTA.
danielcampos93 · 3 months ago
I think you would be very surprised by the number of customers who don't care if the embeddings are SOTA. For every Joe who wants to talk GraphRAG + MTEB + CMTEB and adaptive rag there are 50 who just want whatever IT/prodsec has approved
danielcampos93 commented on Generative AI hype peaking?   bjornwestergard.com/gener... · Posted by u/bwestergard
YetAnotherNick · 9 months ago
Nvidia is up 24% in last 1 year compared to <10% for Nasdaq or S&P. Cherry picking the point to compare to is bad.
danielcampos93 · 9 months ago
they also 10xed their revenue. 24 seems low for someone that pulled that off.
danielcampos93 commented on Show HN: Benchmarking VLMs vs. Traditional OCR   getomni.ai/ocr-benchmark... · Posted by u/themanmaran
danielcampos93 · 10 months ago
GPT-4o as a judge to evaluate the quality of something which gpt4o is not inherently that good at. Red flag.

u/danielcampos93

KarmaCake day449June 6, 2013
About
Perfection is the enemy of good https://spacemanidol.github.io/
View Original