Today, we are introducing KimiK2Thinking, our best open-source thinking model.
Built as a thinking agent, it reasons step by step while using tools, achieving state-of-the-art performance on Humanity’s Last Exam (HLE), BrowseComp, and other benchmarks, with major gains in reasoning, agentic search, coding, writing, and general capabilities. (View Highlight)
Kimi K2 Thinking can execute up to 200 – 300 sequential tool calls without human interference, reasoning coherently across hundreds of steps to solve complex problems. (View Highlight)
It marks our latest efforts in test-time scaling, by scaling both thinking tokens and tool calling steps.
K2 Thinking is now live on kimi.com under the chat mode [1], with its full agentic mode available soon. It is also accessible through the Kimi K2 Thinking API. (View Highlight)
Kimi K2 Thinking sets new records across benchmarks that assess reasoning, coding, and agent capabilities. K2 Thinking achieves 44.9% on HLE with tools, 60.2% on BrowseComp, and 71.3% on SWE-Bench Verified, demonstrating strong generalization as a state-of-the-art thinking agent model. (View Highlight)
K2 Thinking demonstrates outstanding reasoning and problem-solving abilities. On Humanity’s Last Exam (HLE)—a rigorously crafted, closed‑ended benchmark—spanning thousands of expert‑level questions across more than 100 subjects, K2 Thinking achieved a state-of-the-art score of44.9%, with search, python, and web-browsing tools, establishing new records in multi‑domain expert‑level reasoning performance. (View Highlight)
By reasoning while actively using a diverse set of tools, K2 Thinking is capable of planning, reasoning, executing, and adapting across hundreds of steps to tackle some of the most challenging academic and analytical problems. In one instance, it successfully solved a PhD-level mathematics problem through 23 interleaved reasoning and tool calls, exemplifying its capacity for deep, structured reasoning and long-horizon problem solving: (View Highlight)
K2 Thinking demonstrates strong performance in agentic search and browsing scenarios. On BrowseComp—a challenging benchmark designed to evaluate models’ ability to continuously browse, search, and reason over hard-to-find real-world web information—K2 Thinking achieved a score of 60.2%, significantly outperforming the human baseline of 29.2%. This result highlights K2 Thinking’s superior capability for goal-directed, web-based reasoning and its robustness in dynamic, information-rich environments. (View Highlight)
K2 Thinking can execute 200–300 sequential tool calls, driven by long-horizon planning and adaptive reasoning. It performs dynamic cycles of think → search → browser use → think → code, continually generating and refining hypotheses, verifying evidence, reasoning, and constructing coherent answers. This interleaved reasoning allows it to decompose ambiguous, open-ended problems into clear, actionable subtasks. (View Highlight)
Practical Writing: K2 Thinking demonstrates marked gains in reasoning depth, perspective breadth, and instruction adherence. It follows prompts with higher precision, addressing each requirement clearly and systematically—often expanding on every mentioned point to ensure thorough coverage. In academic, research, and long-form analytical writing, it excels at producing rigorous, logically coherent, and substantively rich content, making it particularly effective in scholarly and professional contexts. (View Highlight)
Personal & Emotional: When addressing personal or emotional questions, K2 Thinking responds with more empathy and balance. Its reflections are thoughtful and specific, offering nuanced perspectives and actionable next steps. It helps users navigate complex decisions with clarity and care—grounded, practical, and genuinely human in tone. (View Highlight)