[0] https://nelly.is
15 blew my mind -- it's too easy to overfit that dataset
Just for context, we’re building Skyvern, an open source AI Agent that can control and interact with browsers using prompts, similar to OpenAI’s Operator.
The MCP Server can:
- This allows Claude to navigate to docs websites / stack overflow and look up information like the top posts on hackernews - https://github.com/Skyvern-AI/skyvern/tree/main/integrations...
- This allows Cursor to apply for jobs / fill out contact forms / login + download files / etc - https://github.com/Skyvern-AI/skyvern/tree/main/integrations...
- Connect Windsruf to take over your chrome while running Skyvern in “local” mode - https://github.com/Skyvern-AI/skyvern/tree/main/integrations...
We built this mostly for fun, but can see this being integrated into AI agents to give them custom access to browsers and execute complex tasks like booking appointments, downloading your electricity statements, looking up freight shipment information, etc
We at Skyvern are still doing patch versions only
While the extraction/2fa flows aren't super relevant to us, this saves us time from building our own set of benchmarks. Really appreciate it and hope we can contribute to make this a really large set.