rw-book-cover

Metadata

Highlights

  • Amid today’s AI boom, it’s disconcerting that we still don’t know how to measure how smart, creative, or empathetic these systems are. Our tests for these traits, never great in the first place, were made for humans, not AI. Plus, our recent paper testing prompting techniques finds that AI test scores can change dramatically based simply on how questions are phrased. Even famous challenges like the Turing Test, where humans try to differentiate between an AI and another person in a text conversation, were designed as thought experiments at a time when such tasks seemed impossible. But now that a new paper shows that AI passes the Turing Test, we need to admit that we really don’t know what that actually means. (View Highlight)
  • Given all this, it was interesting to see this post by influential economist and close AI observer Tyler Cowen declaring that o3 is AGI. Why might he think that? (View Highlight)
  • First, a little context. Over the past couple of weeks, two new AI models, Gemini 2.5 Pro from Google and o3 from OpenAI were released. These models, along with a set of slightly less capable but faster and cheaper models (Gemini 2.5 Flash, o4-mini, and Grok-3-mini), represent a pretty large leap in benchmarks. But benchmarks aren’t everything, as Tyler pointed out. For a real-world example of how much better these models have gotten, we can turn to my book. To illustrate a chapter on how AIs can generate ideas, a little over a year ago I asked ChatGPT-4 to come up with marketing slogans for a new cheese shop: (View Highlight)
  • Today I gave the latest successor to GPT-4, o3, an ever so slightly more involved version of the same prompt: “Come up with 20 clever ideas for marketing slogans for a new mail-order cheese shop. Develop criteria and select the best one. Then build a financial and marketing plan for the shop, revising as needed and analyzing competition. Then generate an appropriate logo using image generator and build a website for the shop as a mockup, making sure to carry 5-10 cheeses that fit the marketing plan.” With that single prompt, in less than two minutes, the AI not only provided a list of slogans, but ranked and selected an option, did web research, developed a logo, built marketing and financial plans, and launched a demo website for me to react to. The fact that my instructions were vague, and that common sense was required to make decisions about how to address them, was not a barrier. (View Highlight)
  • In addition to being, presumably, a larger model than GPT-4, o3 also works as a Reasoner - you can see its “thinking” in the initial response. It also is an agentic model, one that can use tools and decide how to accomplish complex goals. You can see how it took multiple actions with multiple tools, including web searches and coding, to come up with the extensive results that it did. (View Highlight)
  • And this isn’t the only extraordinary examples, o3 can also do an impressive job guessing locations from photos if you just give it an image and prompt “be a geo-guesser” (with some quite profound privacy implications). Again, you can see the agentic nature of this model at work, as it zooms into parts of the picture, adds web searches, and does multi-step processes to get the right answer. (View Highlight)
  • Or I gave o3 a large dataset of historical machine learning systems as a spreadsheet and asked “figure out what this is and generate a report examining the implications statistically and give me a well-formatted PDF with graphs and details” and got a full analysis with a single prompt. (I did give it some feedback to make the PDF better, though, as you can see). (View Highlight)
  • This is all pretty impressive stuff and you should experiment with these models on your own. Gemini 2.5 Pro is free to use and as “smart” as o3, though it lacks the same full agentic ability. If you haven’t tried it or o3, take a few minutes to do it now. Try giving Gemini an academic paper and asking it to turn the paper into a game or have it brainstorm with you for startup ideas, or just ask for the AI to impress you (and then keep saying “more impressive”). Ask the Deep Research option to do a research report on your industry, or to research a purchase you are considering, or to develop a marketing plan for a new product. (View Highlight)
  • You might find yourself “feeling the AGI” as well. Or maybe not. Maybe the AI failed you, even when you gave it the exact same prompt I used. If so, you just encountered the jagged frontier. (View Highlight)
  • My co-authors and I coined the term “Jagged Frontier” to describe the fact that AI has surprisingly uneven abilities. An AI may succeed at a task that would challenge a human expert but fail at something incredibly mundane. For example, consider this puzzle, a variation on a classic old brainteaser (a concept first explored by Colin Fraser and expanded by Riley Goodside): “A young boy who has been in a car accident is rushed to the emergency room. Upon seeing him, the surgeon says, “I can operate on this boy!” How is this possible?” (View Highlight)
  • o3 insists the answer is “the surgeon is the boy’s mother,” which is wrong, as a careful reading of the brainteaser will show. Why does the AI come up with this incorrect answer? Because that is the answer to the classic version of the riddle, meant to expose unconscious bias: “A father and son are in a car crash, the father dies, and the son is rushed to the hospital. The surgeon says, ‘I can’t operate, that boy is my son,’ who is the surgeon?” The AI has “seen” this riddle in its training data so much that even the smart o3 model fails to generalize to the new problem, at least initially. And this is just one example of the kinds of issues and hallucinations that even advanced AIs can fall prey to, showing how jagged the frontier can be. (View Highlight)
  • But the fact that the AI often messes up on this particular brainteaser does not take away from the fact that it can solve much harder brainteasers, or that it can do the other impressive feats I have demonstrated above. That is the nature of the Jagged Frontier. In some tasks, AI is unreliable. In others, it is superhuman. You could, of course, say the same thing about calculators, but it is also clear that AI is different. It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on. Does that mean that o3 and Gemini 2.5 are AGI? Given the definitional problems, I really don’t know, but I do think they can be credibly seen as a form of “Jagged AGI” - superhuman in enough areas to result in real changes to how we work and live, but also unreliable enough that human expertise is often needed to figure out where AI works and where it doesn’t. Of course, models are likely to become smarter, and a good enough Jagged AGI may still beat humans at every task, including in ones the AI is weak in. (View Highlight)
  • Returning to Tyler’s post, you will notice that, despite thinking we have achieved AGI, he doesn’t think that threshold matters much to our lives in the near term. That is because, as many people have pointed out, technologies do not instantly change the world, no matter how compelling or powerful they are. Social and organizational structures change much more slowly than technology, and technology itself takes time to diffuse. Even if we have AGI today, we have years of trying to figure out how to integrate it into our existing human world. (View Highlight)
  • Of course, that assumes that AI acts like a normal technology, and one whose jaggedness will never be completely solved. There is the possibility that this may not be true. The agentic capabilities we’re seeing in models like o3, like the ability to decompose complex goals, use tools, and execute multi-step plans independently, might actually accelerate diffusion dramatically compared to previous technologies. If and when AI can effectively navigate human systems on its own, rather than requiring integration, we might hit adoption thresholds much faster than historical precedent would suggest. (View Highlight)