rw-book-cover

Metadata

  • Author: The Batch @ DeepLearning.AI
  • Full Title: Ling-1t Leads Non-Reasoning Performance, MCP Poses Security Risks, California Regulates AI, Auto-Tune for Agentic Prompts

Highlights

  • I explained how effective agentic AI development needs a disciplined evals and error analysis process, and described an approach to performing evals. This week, I’d like to summarize the core ideas behind error analysis and describe some best practices. Given the rapid pace of improvement in LLMs, when error analysis points to a problem, your options for how to address it are greater than before. L (View Highlight)
  • Take the problem of building a basic Deep Research agent that searches the web to write a detailed report on a topic like “recent developments in black-hole science.” An agent might take a sequence of steps to generate the final report, such as (i) use an LLM to generate a handful of web search queries related to the topic, (ii) call a web-search API to get lists of results, (iii) use an LLM to identify the most promising sources to fetch, and (iv) ask the LLM to use these sources to write the report.  If the final report is subpar compared to the work of a human researcher following the same steps, the gap in performance could be from any of the steps. A basic error analysis procedure might involve gathering a sample set of topics where the output is subpar, and reading the results of every step of the workflow — called the traces — to see which step most frequently generated results materially worse than a human would have. This is very valuable for deciding what step to focus on improving. (View Highlight)
  • A common misconception of error analysis is that it takes a lot of work to get started. The key principle is to look at the steps of the workflow and see which steps did a bad job on a given input, often by benchmarking to human level performance (HLP). Assuming we are automating a task where HLP is desirable, then the most important thing is to systematically examine traces to understand when the agent is falling short of HLP. And just as we can get started with evals using a quick-and-dirty initial cut at it (maybe using just a handful of examples) followed by iterating to improve, so too with error analysis. (View Highlight)
  • Specifically, it is fine to start by reading one or just a handful of traces informally to get a sense of what might be going wrong. For example, if you see that the web search query terms in your Deep Researcher — step (i) above — frequently make no sense, that points you to an initial area to focus your efforts on improving. As the system matures, you can move incrementally toward more rigorous error analysis. For example, you might eventually end up with a regularly refreshed dataset of thousands of examples where the performance is poor, and carry out rigorous evaluations that show exactly what percentage of the time each of the steps (i) - (iv) contributed to problems with the final output, and also in what specific ways those steps fell short. (View Highlight)
  • In addition to improving the execution of individual steps, we can change how we decompose a complex task into steps. When it came to pipelines built using machine learning or deep learning rather than LLMs, I found that the structure of the workflow — that is, how you decompose an overall task into a sequence of steps to be carried out — changed rarely. It was a big deal to rearchitect this! But in the past couple of years, because LLMs are improving so rapidly, I see much more rapid iteration on the design of workflows. (View Highlight)
  • For example, one very common pattern is ripping out scaffolding and letting the LLM do more. This is often a good move when you now have access to a smarter LLM than you did when you first built a workflow. For example, perhaps you once used an LLM to clean up downloaded web pages by removing navigational links, ads, extraneous HTML, and the like, before a separate LLM used the cleaned-up pages to write a report. Since LLMs have become smarter, you might decide to skip the first step and dump messier HTML into the final LLM without an initial clean-up step, which can introduce its own errors. (View Highlight)
  • Another example: Perhaps a year ago, we used hard-coded rules to decide what web pages to fetch and when to fetch more, but today we might let an LLM-based agent make this decision more autonomously. As LLMs get smarter, I see many teams rearchitecting workflows to remove hard-coded steps or constraints that were previously needed to keep the system from going off the rails. One way to spot opportunities for doing this is if error analysis shows that a sequence of steps collectively underperforms compared to what a human might do, even though the performance of each individual step is good. This might indicate that the way those steps are carried out is too rigid. (View Highlight)
  • Reasoning models typically learn to undertake a separate process of “thinking” through their output of before they produce final response. Ant Group built a top non-reasoning model that can take similar steps as part of its immediate response. What’s new: Ant Group, an affiliate of Alibaba and owner of the online payments provider Alipay, released Ling-1T, a huge, open, non-reasoning model that outperforms both open and closed counterparts.  • Input/output: Text in (up to 128,000 tokens), text out (up to 32,000 tokens) • Architecture: Mixture-of-Experts (MoE) transformer, 1 trillion parameters, 50 billion parameters active per token • Performance: Outperformed leading non-reasoning models in 22 of 31 benchmark tests of reasoning, math, coding, general knowledge, and writing. • Availability: Weights free to download from HuggingFace and ModelScope for commercial and noncommercial uses under the MIT license, API 0.112/$2.24 per million input/cached/output tokens via zenmux.aiUndisclosed: Training data, specific training methods How it works: The team emphasized chain-of-thought reasoning in both the pretraining and fine-tuning phases of development, but it didn’t train the model to undertake a separate reasoning, or thinking, process before producing its final output. This means the model can reason selectively depending on the input. (View Highlight)
  • • The team pretrained Ling-1T on 20 trillion tokens. In the last part of pretraining, they used a curated subset in which over 40 percent consisted of chain-of-thought data.  • They fine-tuned the model via supervised fine-tuning on examples that were augmented with chains of thought via CoT-Evo. (View Highlight)
  • CoT-Evo takes a training dataset and generates and evolves chains of thought (CoTs) for each example in the dataset. It evolves CoTs by repeatedly scoring them, selecting them (based on score, difference from other CoTs, and random chance), and modifying them via an LLM. The team fine-tuned Ling-1T on the examples with the highest-scoring CoTs. (View Highlight)
  • n addition, they fine-tuned the model using a reinforcement learning algorithm developed internally called Linguistic-Unit Policy Optimization (LPO). Unlike GRPO and GSPO, LPO “treats sentences as the natural semantic action units, enabling precise alignment between rewards and reasoning behavior,” the company said. (View Highlight)
  • esults: In Ant Group’s tests, Ling-1T generally outperformed three top non-reasoning models: DeepSeek-V3.1-Teriminus (thinking mode disabled), Moonshot Kimi-K2-Instruct, and OpenAI GPT-5 (thinking mode disabled), as well as Google Gemini 2.5 Pro set to minimum thinking (128 tokens). • Ling-1T achieved the highest performance on 22 of 31 benchmarks tested and best or second-best performance on 29 of 31 benchmarks that cover general knowledge, coding, math, reasoning, writing, and agentic tasks. • It performed best in the math and reasoning categories, achieving the best performance in all benchmarks tested. For instance, on math questions in AIME 2025, Ling-1T achieved 70.42 percent accuracy, whereas the second-best model, Gemini 2.5 Pro set to minimum thinking, achieved 70.10 percent accuracy. (View Highlight)
  • Behind the news: Concurrently with Ling-1T, Ant Group released a finished version of its 1 trillion-parameter reasoning model, Ring-1T, which was available previously as a preview. While Ling-1T’s performance exceeded that of top non-reasoning models, Ring-1T achieved second-place performance relative to reasoning models on almost every benchmark tested. (View Highlight)
  • Why it matters:  Ling-1T generally outperforms the mighty Kimi K2 and closes the gap between open and closed nonreasoning models. A ginormous parameter count and pretraining on chains of thought appear to have been key factors in this accomplishment. Having been pretrained with an intense focus on chains of thought, Ling-1T is primed to generate a chain of thought before it concludes a response, although not in a separate reasoning stage. Such training blurs the line between reasoning and non-reasoning models. We’re thinking: Two years ago, weights for Ling-family models were closed, but in the past year Ant Group has released open weights for several. With consistent effort and investment, Ling has gone from a family that few had heard of to challenging the top dogs. (View Highlight)
  • The ability to easily connect large language models to tools and data sources has made Model Context Protocol popular among developers, but it also opens security holes, research shows. What’s new: Golan Yosef at Pynt, an API security firm, analyzed security risks of Model Context Protocol (MCP) servers. The work shows that when systems use multiple MCP servers, vulnerabilities rise rapidly. (View Highlight)
  • Behind the news: Anthropic launched MCP in November 2024, and OpenAI and Microsoft adopted it by spring 2025. Despite its lax security, the protocol now connects to over 6,000 servers. Authentication remained optional until March, when OAuth 2.1 authorization frameworks were added. The change prevents unauthorized access to MCP servers, but it doesn’t prevent malicious or malformed data from flowing between servers and triggering unintended actions. (View Highlight)
  • Why it matters: Securing individual MCP servers is important but not sufficient, because vulnerabilities can emerge from interactions among servers. Adding more servers can make a system more agentic, but it also compounds vulnerabilities. The study suggests that developers mitigate this “compositional risk” by using only the servers they need, constraining what each one is allowed to do, and testing transfers of data among them. We’re thinking: Securing individual components is a tough task in its own right, but systems of MCP components must be secured at the system level. (View Highlight)
  • In the absence of national laws that specifically regulate AI in the United States, California moved to regulate the technology within its own borders, passing four bills in less than a month. What’s new: Governor Gavin Newsom signed into law SB 53, which requires large AI developers to disclose their safety protocols. In addition, SB 243 regulates chatbots, AB 316 makes developers liable for the actions of autonomous systems they build, and AB 853 requires AI-generated media to be labeled clearly. (View Highlight)
  • How it works: Together, the bills don’t ban any particular applications outright or restrict AI development, but they require extensive disclosures, either to the state or directly to users. Some took effect immediately while others, such as SB 243, will phase in by January 2027. • SB 53 requires that developers of frontier models, defined as those whose training requires processing greater than 1026 integer or floating-point operations — a level currently associated with very large and powerful models — provide more transparency about their models’ capabilities and potential risks. It also requires that developers with annual revenue above 1 million. The law protects whistleblowers within AI companies against retaliation and provides anonymous channels to report illegal or unsafe behavior. The bill takes effect in June 2026. (View Highlight)
  • SB 243 aims to prevent chatbots from harming minors and other vulnerable users. It bars exposing minors to sexual content and requires developers to disclose that chatbots are AI-generated and provide a general warning that chatbots may not be suited for minors. The bill also requires developers to provide specific support to users who discuss suicide or self-harm and to issue an annual report on mental health issues related to using their chatbots. (View Highlight)
  • AB 316 prohibits defendants in lawsuits from shifting responsibility onto AI systems by claiming that they harmed plaintiffs autonomously. It applies to anyone who develops, modifies, or uses an AI system. (View Highlight)
  • • AB 853 requires that AI-generated media be labeled clearly as such. Furthermore, it requires that all media (AI-generated or not) include information about who made it and how. The bill requires that cameras, audio recorders, computers, and other media-capture devices record such provenance data, and that large-scale media distributors (2,000,000 monthly active users or more) disclose it. (View Highlight)
  • Behind the news: SB 53 modifies parts of SB 1047, which Governor Newsom vetoed in 2024 after opposition from the tech community. That law would have required third-party audits and made companies liable for the uses of their models. Recently, Newsom also vetoed SB 7, which would have required employers to notify employees and applicants if AI systems were used to make employment decisions like hiring and firing. (View Highlight)
  • hy it matters: California is the largest U.S. state by the sizes of its population and economy, as well as home of many of the world’s most prominent tech companies and startups, including Google, OpenAI, and Anthropic. These laws will affect users of CA-based tech worldwide along with companies that do business in the state. We’re thinking: While these laws are better for the users, innovators, and businesses than the vetoed SB 1047, some of them perpetuate a major mistake of that legislation by placing regulatory burdens on models rather than applications. A model’s potential applications are unknown until someone implements them, and it makes no sense to limit — or burden with disclosure requirements — the good it might do. Applications, on the other hand, bring verifiable benefits and harms, and society would do well to limit the harms. (View Highlight)
  • Honing an agent’s prompt can yield better results than fine-tuning the underlying large language model via reinforcement learning. What’s new: Lakshya A. Agrawal and colleagues at UC Berkeley, Stanford, BespokeLabs.ai, Notre Dame, Databricks, and MIT developed GEPA, an algorithm that improves the performance of agentic systems by improving their prompts. The authors position it as an efficient alternative to fine-tuning an agent’s large language model via reinforcement learning. (View Highlight)
  • Key insight: Agentic models trained via reinforcement learning typically must take a complicated series of actions to earn a simple reward, including calling a large language model multiple times for different purposes, or modules, of the workflow. But a well designed prompt can take into account the various problems an agent may run into and thus guide the model more efficiently. The trick is to write prompts that anticipate such problems. To accomplish this, a large language model can analyze an agent’s behavior as it responds to a given prompt, identify associations between the prompt and outcome (for instance, a failed tool call), and compose a more effective prompt. (View Highlight)
  • How it works: Prompting agents based on Alibaba’s Qwen3-8B, the authors used GEPA to hone their performance on specific benchmarks. The method iteratively evolves a pool of candidate prompts, beginning with a simple prompt for each LLM call a module of the agent makes, such as “Respond to the query” or “Ensure the response is correct and adheres to the given constraints [specified in the benchmark inputs].” In each cycle, GEPA selects, modifies, and evaluates a prompt to generate a revised prompt that produces better results. (View Highlight)
  • • Given each prompt to be fed to the LLM (initially the default prompts, later revised prompts selected for their effectiveness), the agent responds to a random subset of examples from a benchmark’s training set. • GEPA selects which prompt to modify, alternating between the various modules. A separate Qwen3-8B instance examines the agent’s traces (generated text, tool calls, and results) and revises the prompt. • GEPA evaluates the revised prompt in a two-step process. First it feeds it to the agent along with the examples used previously and the prompts by other modules. If the revised prompt improves the agent’s performance, GEPA adds it to a pool of candidate prompts and then scores its performance on each example in the benchmark’s validation set. • From the pool, GEPA identifies prompts that achieved the highest score on at least one example. It selects a set of prompts (one for each module) for the next round of revision, prioritizing prompts that excelled on multiple questions. • GEPA repeats the previous steps until it has exhausted a predefined processing budget. It chooses the set of prompts that achieved the highest average score across all examples in the validation set. (View Highlight)
  • Results: The authors pitted custom and open-source agents that used GEPA against versions for which Qwen3-8B was fine-tuned on a given benchmark via Group Relative Policy Optimization (GRPO). They measured both the agents’ performance and the number of agent executions required.  • Across HotpotQA (questions that require reasoning over multiple paragraphs), IFBench (following instructions), HoVer (verifying facts), and PUPA (which gauges balance between helpfulness and unwanted sharing of personal information), agents that used GEPA consistently achieved better performance on all four.  • Moreover, they did this with far greater efficiency, requiring up to 35 times fewer agent executions.  Yes, but: The authors compared GEPA to fine-tuning via reinforcement learning using a single, relatively small model. Questions remain regarding how the results would scale to larger models or generalize to other models, and how GEPA would compare to supervised fine-tuning. (View Highlight)