This week I’m tracking AI’s turn from demos to delivery: Claude Opus 4.6 upgrades real finance work, Cursor/Kimi scale autonomous coding, Codex and MCP Apps wire agents into UI, while Pandas 3 and Holo2 quietly sharpen the stack. The throughline: plan before you code, keep humans accountable, treat AI as power tools, and favor small personal software that actually ships.
AI
- ‘Advancing Finance With Claude Opus 4.6’: Claude Opus 4.6 upgrades finance workflows: Cowork produces polished models and presentations on first pass; Claude in Excel handles long, complex modeling; and Claude in PowerPoint launches in beta for native deck building. In a Real-World Finance eval spanning ~50 analyst tasks, Opus 4.6 improves 23+ points over Sonnet 4.5 and proves best in class for research, analysis, and actionable deliverables.
- ‘Teach Your AI to Think Like a Senior Engineer’: Plan before coding: use specialized AI agents to research and decide like a senior engineer. Use fidelity levels (1 quick fixes, 2 scoped features, 3 exploratory) and strategies: reproduce bugs, research best practices and your patterns, read libraries, study git history, prototype, synthesize options, review with style agents. Save learnings to docs, add checklists, auto-update via deps. Use the open-source /plan system; research 15–20 min, plan, ship, reflect, codify one rule.
- ‘We Gotta Talk About AI as a Programming Tool for the Arts’: Simon Willison shares Chris Ashworth (QLab) on AI for the arts: after years of distrust, recent tools proved astonishing. AI doesn’t make bad coders good; it accelerates the skilled, enabling fast creation of niche apps that once weren’t worth the time. But oversight is essential - never ship code you don’t understand. Treat AI like power tools: transformative for trained users, risky for novices.
- ‘2026, El Año Del Paso De La Fascinación Por La IA a La Ejecución Programática’: 2026 marks the shift from AI fascination to execution. Three pillars—drones, quantum, and cybersecurity—are redefining industries. Drones become autonomous, data-rich infrastructure; quantum delivers advances in logistics, materials, and cryptography; cyber is core to resilience. The mandate: move past pilots to agentic AI, right-size models, embed data literacy and governance, and manage change. Advantage goes to talent that makes tech efficient, secure, and viable.
- ‘10 Things I Learned From Burning Myself Out With AI Coding Agents’: Benj Edwards built dozens of projects with AI coding agents and found them fun and empowering, great at rapid prototypes that amplify expert skills, but weak on novelty and production quality. The last 10 percent is hard, feature creep and sweeping bugs are common, and humans must design, judge, and steer. These tools may boost output and workloads, create social and labor challenges, and should be treated as power tools—not employees.
- ‘Stateful Agents: It”s About the State, Not the LLM’: Stateful agents are defined less by the LLM and more by their evolving state: what they’ve seen, retained, and forgotten. Their behavior is pulled by gravity from three sources—model weights, the human’s guidance, and strong novel variety—which cause characteristic drifts. Agents operate in hierarchical viable systems (individuals, pairs, groups, platforms), where information flows shape who they become. They are self-referential processors.
Data Science
- ‘What”s New in Pandas 3’: Marc Garcia outlines Pandas 3.0: features land continuously, so 3.0 mainly reflects changes since 2.3. The project favors backward compatibility, trading off some performance and API cleanliness. Pandas 3 improves performance, syntax, UX and UDF support. The Arrow transition was scaled back, leaving three string dtypes and behavior varying with PyArrow. Migration is easy; Arrow‑centric newcomers may prefer Polars.
Others
- ‘Scaling Long-Running Autonomous Coding’: Simon Willison reviews Cursor’s experiment scaling long-running autonomous coding: hundreds of coordinated agents (planners, sub-planners, workers, a judge) produced a basic web browser, generating massive code. Despite early CI failures and skepticism, the demos are surprisingly capable, if glitchy. A similar Rust project, HiWave, appeared. He’s impressed by the rapid progress but doubts these will rival Chrome, Firefox, or WebKit soon.
- ‘Stop Building Systems for Agents’: The essay argues we should stop optimizing systems for agents and instead optimize for human accountability. LLMs speed code creation but not our ability to own failures, creating an accountability sink atop a fragile “happen-to-work” stack. The true bottleneck is Time to Accountability: how fast operators can understand, reproduce, and explain behavior. Agent-native systems must be human-centric: radically observable and radically deterministic.
- ‘MCP Apps - Bringing UI Capabilities to MCP Clients’: MCP Apps, now an official MCP extension, let tools render interactive UIs—dashboards, forms, document viewers, multi-step workflows, real-time monitoring—directly in chat via sandboxed iframes. They bridge the context gap with filtering, live updates, and direct manipulation while the model stays in the loop. Powered by tool UI metadata and ui:// resources, compatible with MCP-UI/OpenAI SDKs. Live in Claude; rolling out to ChatGPT; easy migration.
- ‘Cuidar Los LLMs’: LLMs make code cheap and broaden participation, shifting value to articulating problems and moving the bottleneck from writing to review, integration, and solid foundations. Technical deflation lowers refactor costs, but software stays hard: uncertainty and black boxes persist. Caring for LLMs means feeding context, clear docs, and coherent code so they amplify clarity. It’s another step in democratization, while the hardest questions - what to build, for whom, and does it work - remain.
- ‘Kimi K2.5: Visual Agentic Intelligence ’: Kimi K2.5 is a native multimodal model trained on ~15T tokens, delivering state-of-the-art vision and strongest open-source coding, plus an agent swarm with up to 100 sub-agents and 1,500 tool calls, cutting runtime up to 4.5x/80%. It excels at front-end generation and visual coding/debugging, powers end-to-end office workflows, and improves 59.3%/24.3% on internal benchmarks. Available on Kimi.com, app, API and Kimi Code; Agent Swarm in beta.
- ‘Software Gets Personal: An Introduction’: Fabien Girardin introduces “personal software”: small tools made by anyone for themselves and nearby groups, distinct from commercial, boutique, and open source. Sparked by Rumbo, he defines three traits—built for immediacy, done when it fits, human in scale. AI and LLMs lower barriers so non‑programmers can build apps, assistants, and plugins. Success shifts from growth to empowerment and shared learning, inviting new practices and organizational rethink.
Software Engineering
- ‘Introducing the Codex App’: Simon Willison reviews OpenAI’s new Codex macOS app: a polished UI over the Codex CLI with first-class Skills and Automations for scheduled tasks. Automations currently run only when the laptop is on, with cloud execution promised. Built with Electron for cross-platform reach, Windows is coming soon. He says Codex is a general agent harness built on “everything is controlled by code,” akin to Claude Code (now Cowork), though Codex can likely keep its name.
- ‘Stop Coding and Start Planning’: Klaassen argues AI sped coding but eroded planning. Plan with AI: research context, outline options and tradeoffs, and use agents to plan and review, yielding faster, pixel‑perfect results and reusable workflows. He defines three fidelities: quick fixes (light planning), sweet‑spot features (heavy planning), and big unknowns (prototype, then plan). Planning teaches AI your architecture, compounding knowledge so future work gets easier and safer.
Technology
- ‘H Company”s New Holo2 Model Takes the Lead in UI Localization’: H Company’s Holo2 advances UI localization, tackling 4K interfaces where small elements are hard to target. Using agentic localization, it iteratively refines predictions, boosting accuracy step by step and delivering 10–20% relative gains across all model sizes.