I activated the demo and Gemini decided to start by navigating to www.google.com in order to search for “hacker news”. But Google served a CAPTCHA challenge, presumably because of a large volume of suspicious traffic from the Browserbase IP range. (View Highlight)
I activated the demo and Gemini decided to start by navigating to www.google.com in order to search for “hacker news”. But Google served a CAPTCHA challenge, presumably because of a large volume of suspicious traffic from the Browserbase IP range. (View Highlight)
It went through a few rounds of this, solved all of them and continued on to Google Search, where it ran the search for “hacker news”, navigated to the site and then did an admittedly unimpressive job of solving the original prompt. It looked at just one thread and reported back on what it found there. I was hoping it would consider more than one option to discover the “most controversial post from today”. (View Highlight)
It went through a few rounds of this, solved all of them and continued on to Google Search, where it ran the search for “hacker news”, navigated to the site and then did an admittedly unimpressive job of solving the original prompt. It looked at just one thread and reported back on what it found there. I was hoping it would consider more than one option to discover the “most controversial post from today”. (View Highlight)
The Gemini 2.5 Computer Use Model card (PDF) talks about training the model to “recognize when it is tasked with a high-stakes action” and request user confirmation before proceeding, but doesn’t have anything to say about not solving CAPTCHAs. So I guess this behaviour is the model working as intended! (View Highlight)
The Gemini 2.5 Computer Use Model card (PDF) talks about training the model to “recognize when it is tasked with a high-stakes action” and request user confirmation before proceeding, but doesn’t have anything to say about not solving CAPTCHAs. So I guess this behaviour is the model working as intended! (View Highlight)
Something that did impress me—aside from the unprompted CAPTCHA solve against Google’s very own system—was the quality of the mouse usage. I’ve written about Computer Use models before from both Anthropic and OpenAI (they called their version “Operator”) and by far the biggest challenge for them is accurately clicking the right targets with the mouse. (View Highlight)
Something that did impress me—aside from the unprompted CAPTCHA solve against Google’s very own system—was the quality of the mouse usage. I’ve written about Computer Use models before from both Anthropic and OpenAI (they called their version “Operator”) and by far the biggest challenge for them is accurately clicking the right targets with the mouse. (View Highlight)
It would take a formal eval to derive if Gemini really is best at this, but given the Gemini models previous demonstrations of both bounding boxes and image segmentation masks it doesn’t surprise me that a Gemini model can do a great job of clicking on the right elements in a screenshot of an operating system or browser. (View Highlight)
It would take a formal eval to derive if Gemini really is best at this, but given the Gemini models previous demonstrations of both bounding boxes and image segmentation masks it doesn’t surprise me that a Gemini model can do a great job of clicking on the right elements in a screenshot of an operating system or browser. (View Highlight)