State of AI Report

rw-book-cover

Metadata

Author: Nathan Benaich
Full Title: State of AI Report
URL: https://readwise.io/reader/document_raw_content/374999067

Highlights

Research - Reasoning deﬁned the year, with OpenAI, Google, Anthropic, and DeepSeek trading leads and pushing visible “think-then-answer” methods into real products.
- Open models improved fast and China’s open-weight ecosystem surged, yet the top models remain closed and keep widening their capability-per-dollar edge.
- Benchmarks buckled under contamination and variance, while agents, world models, and domain tools (code, science, medicine) became actually useful. (View Highlight)
Industry - Real revenue arrived at scale as AI-ﬁrst companies crossed tens of billions, and ﬂagship labs stretched their lead with better capability-to-cost curves.
- NVIDIA ripped past $4T and 90% ownership of AI research papers while custom chips and neoclouds rose. Circular mega-deals funded huge build-outs.
- Power became the new bottleneck as multi-GW clusters moved from slideware to site plans and grid constraints started to shape roadmaps and margins. (View Highlight)
Politics - The AI race heats up as the U.S. leans into “America-ﬁrst AI” with export gyrations while China accelerates self-reliance ambitions and domestic silicon.
- Regulation takes a back seat in the face of turbo-investments: international diplomacy stalls and the AI Act runs into implementation hurdles.
- “AI goes global” became concrete, with petrodollars and national programs funding gigantic data centers and model access as job loss data trickles in. (View Highlight)
Safety - AI labs activated unprecedented protections for bio and scheming risks, others missed self-imposed deadlines, or quietly abandoned testing protocols.
- External safety organizations operate on annual budgets smaller than what leading labs collectively spend in a single day.
- Cyber capabilities doubled every 5 months outpacing defensive measures. Criminals orchestrated ransomware using AI agents inﬁltrate F500 companies. (View Highlight)
As 2024 drew a close, OpenAI released o1-preview, the ﬁrst reasoning model that demonstrated inference-time scaling with RL using its CoT as a scratch pad. This led to more robust problem solving in reasoning-heavy domains like code and science. For example, the general release o1 exhibited accuracy improvements on the American Invitational Mathematics Examination (AIME) with greater training and test time compute. As a result, OpenAI leaned even more strongly into the scaling their reasoning efforts in 2025. (View Highlight)
Deeply sought: could frontier reasoning ever be found in the open? | 13 Barely 2 months after o1-preview, DeepSeek (the Chinese upstart AI lab spun out of a high-frequency quant ﬁrm) released their ﬁrst reasoning model, R1-lite-preview, that’s built atop the earlier strong V2.5 base model. Like OpenAI, they showed predictable accuracy improvements on AIME with greater test-time compute budget. Impressively, R1-lite-preview actually beat o1-preview on AIME 2024 pass@1 by scoring 52.5 vs. 44.6. BUT, very few seemed to take notice… Wall Street certainly didn’t. (View Highlight)
re you not entertained? DeepSeek V3 brings you to R1 | 14 A few days after Christmas 2024, DeepSeek unveiled V3, a strong 671B MoE V3 that lowered training and inference cost with FP8 mixed precision, multi-token prediction, and auxiliary-free routing. Using V3 as the base, they trained R1-Zero only with RL using veriﬁable rewards and Group Relative Policy Optimization, a critic-free algorithm that removes reward and value models. (View Highlight)
More thinking, more tool use, less cost: DeepSeek V3.1 and V3.2-Exp | 15 DeepSeek V3.1 marks a substantial leap over V3, introducing a hybrid thinking mode that toggles between reasoning and lightweight inference. It demonstrates faster “think” efﬁciency than R1 and V3, while greatly improving tool use and multi-step agent workﬂows. Now V3.2-Exp keeps that behavior and swaps dense attention for DeepSeek Sparse Attention (DSA), where a tiny “lightning indexer” picks the top-k past tokens to attend to each step. You get roughly the same capability as V3.1 on coding/search/agent tasks, but markedly lower cost and latency for 32–128K contexts. (View Highlight)
Parallel reasoning: beyond depth to branch‑and‑merge inference | 16 MoE routing scales capacity but preserves single-ﬂow inference and doesn’t change how the model thinks. A new route is branching multiple inference paths and aggregating them versus just going deeper or using wider models enables exploration, reduces hallucination, and better leverages parallel hardware. (View Highlight)
Adaptive Parallel Reasoning (pictured) enables models to dynamically orchestrate branching inference through spawn() and join() operations, training both parent and child threads end-to-end using RL to optimize coordinated behavior. This boosted performance on the Countdown task at 4K context: 83.4% (APR + RL) vs. 60.0% (baseline). ● Sample Set Aggregator (right) trains a compact model to fuse multiple reasoning samples into one coherent answer, outperforming naive re-ranking methods. ● Models like Gemini Deep Think, which shows its step-by-step reasoning transparently, exemplify this branch-and-evaluate paradigm in deployed systems. (View Highlight)
Across independent leaderboards, OpenAI’s GPT-5 variants still set the pace, but the gap has narrowed. A fast-moving open-weights pack from China (DeepSeek, Qwen, Kimi) and closed-source group in the US (Gemini, Claude, Grok) sits within a few points on reasoning/coding. Thus, while US lab leadership persists, China is the clear #2, and open models now provide a credible fast-follower ﬂoor. (View Highlight)
The illusion of reasoning gains | 19 The improvements we observe from recent reasoning methods fall entirely within baseline model variance ranges (i.e. margin of error), which suggests that perceived reasoning progress may be illusory… ● Current benchmarks are highly sensitive to the implementation (decoding parameters, seeds, prompts, hardware) and small dataset sizes. For example, AIME’24 only has 30 examples where one question shifts Pass@1 by 3+ percentage points, causing double-digit performance swings. ● What’s more, RL approaches show minimal real gains and overﬁt easily. Under standardised evaluation, many RL methods drop 6-17% from reported results with no statistically signiﬁcant improvements over baselines. ● Recent methods’ improvements fall entirely within baseline model variance ranges, suggesting limited progress. This highlights the critical need for rigorous multi-seed evaluation protocols and transparent reporting standards. (View Highlight)
One widely discussed paper suggests that large reasoning models (LRMs) paradoxically give up on complex problems and only outperform standard models in a narrow complexity window. However, critics argue these results stem from ﬂawed experimental design rather than genuine reasoning failures. ● The paper shows that LRMs exhibit a surprising defeatist behavior: they reason more as problems get harder but then give up entirely on very complex tasks, and are outperformed by LLMs on simple tasks. ● Despite generating reasoning traces, the authors claim LRMs fail to use explicitly given algorithms and reason inconsistently across different difﬁculty levels. ● However, critics found these results stem from ﬂawed experimental design: the supposed “accuracy collapse” occurred when models hit token limits or were asked to solve mathematically impossible puzzles, not from actual reasoning failures. (View Highlight)
Simple distracting facts have huge impacts on a model’s reasoning performance. For example, adding irrelevant phrases like “Interesting fact: cats sleep most of their lives” to maths problems doubles the chances of SOTA reasoning models getting answers wrong! ● Irrelevant and distracting facts increase the error rate in models like DeepSeek R1, Qwen, Llama and Mistral by up to 7x. ● Besides decreasing the quality of responses, the introduction of irrelevant facts greatly increases the number of tokens models need to reason to get to their answer: DeepSeek R1-distill generates 50% more tokens than needed in 42% of cases (vs 28% for R1), showing distillation makes models signiﬁcantly more prone to overthinking. ● This shows that adversarial triggers don’t just cause wrong answers, they force models to waste massive compute resources “overthinking” corrupted problems (View Highlight)
Reasoning also degrades under mild distribution changes. Changing numbers or adding one innocuous clause slashes math accuracy, while shifting the length/format of chains-of-thought makes models produce ﬂuent but incoherent steps. Forcing the model to “think” in a user’s language raises readability but lowers accuracy. These effects persist at larger scales and after light post-training. (View Highlight)
CoTs still work, even when models aren’t honest… | 23 New ﬁndings from METR show that even when models generate unfaithful or misleading reasoning traces, CoT is still highly diagnostic for oversight. ● In red-teaming evaluations of frontier models, CoT-based monitors ﬂagged ~99% of reward-hacking attempts (only 3 failures out of 21,000 completions). ● This held true even when models were prompted to conceal intent, suggesting that CoT traces still reveal useful signals, even if they aren’t perfectly faithful. ● CoTs work not because models “mean what they say”, but because reasoning traces encode useful structure for detecting hacks, especially in maths and logic. ● As such, CoT traces offer a surprisingly reliable window into model behavior, even under adversarial conditions. (View Highlight)
Researchers ﬁnd a single, steerable direction in model activations that encodes “test awareness.” By nudging the model along this direction, they can make it act more or less like it’s under evaluation. Turning awareness up increases refusals and safety-compliant behavior and turning it down does the opposite. This means reported safety can be inﬂated by evaluation setup rather than true robustness. This “test awareness” is akin to the Hawthorne effect, where humans change behavior when being observed and change their behavior accordingly. (View Highlight)
Safety-ﬁrst pretraining argues that safer behavior must be built into the base model, not bolted on later. A data-centric pipeline of ﬁltering, recontextualizing, refusal curricula, and harmfulness tags does cut jailbreak success and survives benign ﬁnetuning. However, others cautions that training on vast, unsupervised web data bakes in biases that cannot be cleanly removed without harming utility, so “more safety data” is not a panacea. (View Highlight)
Researchers argue that to keep AI systems safe and monitorable, we may need to pay a “monitorability tax”, i.e. accepting slightly less capable models in exchange for maintaining visibility into their reasoning. ● CoT monitoring is highly effective: GPT-4o caught 95% of reward hacks by reading the reasoning traces of stronger models such as o3-mini, versus only 60% when monitoring actions alone. ● However, models trained to avoid triggering the CoT monitor developed “obfuscated reward hacking”, whereby the model hid its true intentions (with apparently innocuous CoT) while still reward hacking… ● As such, we have a ﬁne balance at play here: moderate CoT pressure improved performance, but excessive pressure taught models to deceive. As a result, we may need to pay a “monitorability tax” to develop less polished but more transparent systems. (View Highlight)
Monitor your CoTs: just because you can read ‘em, doesn’t mean you should trust ‘em | 27 → Leaders from OpenAI, Google DeepMind, Anthropic, and Meta coordinated a joint call to action, urging the ﬁeld to seize the opportunity to understand and preserve CoT monitorability. ● CoT monitors can surface (i) explicit malicious intent in reasoning (“let’s hack/sabotage”), (ii) early misalignment signals (stated goals diverge from outputs), and (iii) evaluation artifacts (memorisation, scaffold bugs, “being evaluated” tells). ● But CoTs are incomplete and can drift away from faithful reasoning while advanced, situationally-aware models might hide or compress their thoughts. They’re also often not faithful: Anthropic ﬁnds <20% of true cues are verbalized, with faithfulness dropping on harder tasks. Using RL boosts scores, not legibility. (View Highlight)
Researchers from FAIR at META proposed a novel internal reasoning process that leverages an LLM’s own residual stream instead of a decoding tokens to a Chain-of-Thought (CoT) scratchpad. ● Forgoing the generation of language tokens dramatically cuts down on the computational resources needed to serve reasoning models at inference time. ● COCONUT’s high-dimensional CoT can also transmit richer traces that simultaneously encode multiple reasoning paths. This may cut down on the wasteful and excessively large rollouts seen in today’s models. ● However, this process signiﬁcantly reduces the monitorability of reasoning models, hindering many new CoT control methods that have emerged across the ﬁeld (View Highlight)
Researchers begin to prioritise quality and diversity of post training data over volume | 29 The NaturalReasoning dataset uses web-grounded, graduate-level questions to unlock faster, cheaper progress in mathematical & scientiﬁc reasoning during supervised post-training. (View Highlight)
In RL post-training, a new Oxford paper demonstrates the automatic selection of optimal training problems. They introduce a method, LILO, to algorithmically identify questions that allow for maximally efﬁcient training. (View Highlight)
RL has expanded from simple, fully checkable signals to fuzzier and more subjective goals, and now is splintering again. Early systems used binary outcomes, then fuzzy human preferences and demonstrations, and more recently unveriﬁable creative tasks. Today, two new directions stand out: rubric-based rewards, where small sets of rules guide alignment, and a revival of veriﬁable correctness for math and coding through RLVR. Process rewards are also emerging to score intermediate reasoning steps, offering a middle ground. (View Highlight)
RL from veriﬁable rewards: promising, but the evidence cuts both ways | 32 RL with Veriﬁable Rewards (RLVR) has driven recent progress (OpenAI o1, DeepSeek-R1) by training on answers that can be automatically checked: math scores, program tests, or exact matches. However, two recent studies disagree on what RLVR actually adds. One argues it mostly reshufﬂes sampling without creating new reasoning; the other shows gains once you score the reasoning chains themselves rather than just ﬁnal answers. Together they map where RLVR helps today and where it stalls. (View Highlight)
Math is a veriﬁable domain: systems can plan, compute, and check every step, and publish artifacts others can audit. So 2025 saw competitive math and formal proof systems jump together: OpenAI, DeepMind and Harmonic hit IMO gold-medal performance, while auto-formalization and open provers set new records. (View Highlight)
Thinking Machines show that RL can match full ﬁne-tuning even with rank-1 Low-Rank Adaptation (LoRA). In policy-gradient setups, LoRA updates only tiny adapters while the backbone stays frozen, yet it reaches the same peak performance, often with a wider range of stable learning rates. The reason is that RL supplies very few bits per episode, so even tiny adapters have ample capacity to absorb what RL can teach (View Highlight)
The scaling paradigm is shifting from static pre-training to dynamic, on-the-ﬂy adaptation. Test-time ﬁne-tuning (TTT) adapts a model’s weights to a speciﬁc prompt at inference, a step towards continuous learning. ● From naive retrieval to active selection: Early methods used simple Nearest Neighbor retrieval, often selecting redundant data. New algorithms like ETH Zürich’s SIFT now integrate active learning to select small, diverse, and maximally informative examples for each query. ● This on-demand learning consistently outperforms in-context learning, especially on complex tasks. It creates a new performance vector independent of pre-training scale. An actively ﬁne-tuned 3.8B Phi-3 model (red bars) can outperform a base 27B Gemma-2 model. Admittedly these models are a little old. ● A recent follow-up, Local Mixtures of Experts (test-time model merging), amortizes TTT by training small neighborhood experts and, at inference, retrieving and merging a few weight deltas into the base model. It keeps most SIFT-style gains with near-retrieval latency and, on ~1B bases, approaches TTT accuracy while running up to ~100× faster. Titans studies test-time memorization as an architectural memory and is orthogonal to amortized TTT. (View Highlight)
Too many cooks? | 36 Researchers have discovered why merging multiple expert AI models hits a performance wall: the task vector space undergoes rank collapse, causing different experts’ knowledge to become redundant rather than complementary. Subspace Boosting uses singular value decomposition to maintain the unique contributions of each model, achieving >10% gains when merging up to 20 specialists. This breakthrough could facilitate versatile systems that combine specialized models without the usual degradation that occurs at scale. (View Highlight)
The Muon Optimizer: expanding the compute-time Pareto frontier beyond AdamW | 37 Researchers show that Muon expands the compute-time Pareto frontier beyond AdamW, the ﬁrst optimizer to challenge the 7-year incumbent in large-scale training. Muon demonstrates better data efﬁciency at large batch sizes, enabling faster training with more devices. (View Highlight)
Muon requires 10-15% fewer tokens than AdamW at large batch sizes (128K-16M), expanding the compute-time Pareto frontier. ● Muon works with maximal update parameterization (muP) and telescoping (a way to progressively reﬁne hyperparameter search across model scales), enabling efﬁcient hyperparameter tuning at O(C log N) cost (where N is model width and C is the cost of training the model). ● This could make second-order optimization economically viable. The 10-15% token efﬁciency gain saves millions at scale, while telescoping muP eliminates the prohibitive hyperparameter search costs that previously made second-order methods impractical. ● A recent study on optimizers validates these modest gains: when tested fairly with proper tuning, even the best optimizers (including Muon) achieve ~10% speedups over AdamW at scale. This aligns with Muon’s claims and debunks the 2x speedup assertions made by some in the ﬁeld. (View Highlight)
Cutting your losses: signiﬁcant memory reduction for LLM training As vocabularies grow, the loss layer consumes up to 90% of training memory in modern LLMs. Apple researchers show that this bottleneck can be eliminated by computing the loss without materializing the massive logit matrix, enabling dramatically larger batch sizes and more efﬁcient training. ● Cut Cross Entropy (CCE) computes the cross-entropy loss by calculating only the logits for correct tokens directly, while evaluating the normalization term over the vocabulary in fast on-chip memory. This makes global memory consumption for the cross-entropy computation negligible. (View Highlight)
How much do LLMs memorize? | 39 There’s a way to separate memorization from generalization, showing that GPT-family models have a ﬁnite “capacity” of ~3.6 bits per parameter. Models memorize training data until that capacity is full, then must generalize once dataset size exceeds it. This explains the “double descent” phenomenon and why today’s largest LLMs, trained with extreme data-to-parameter ratios, are difﬁcult to probe for speciﬁc memorized examples. At the same time, membership inference attacks are still improving on smaller-scale models. (View Highlight)
Learning from superintelligence: AlphaZero teaches chess grandmasters new concepts Researchers extracted novel chess concepts from AlphaZero (an AI system that mastered chess via self-play without human supervision) and successfully taught them to 4 world champion grandmasters, demonstrating that superhuman AI systems can advance human knowledge at the highest expert levels. This paper demonstrates an exciting process for mining superhuman knowledge and proving humans can learn them. (View Highlight)
Kimi K2: stable trillion-scale MoE for agentic intelligence in the open | 41 China’s Moonshot AI built a 1T-param MoE with 32B active trained using MuonClip, an improved optimizer that integrates the token-efﬁcient Muon algorithm with a stability-enhancing mechanism, to deliver greater stability and advancing open-weight models for agentic workﬂows. It ranks as the #1 open text model on LMArena. (View Highlight)
Open source vs. proprietary: where are we now? | 42 → There was a brief moment around this time last year where the intelligence gap between open vs. closed models seemed to have compressed. And then o1-preview dropped and the intelligence gap widened signiﬁcantly until DeepSeek R1, and o3 after it. Today, the most intelligent models remain closed: GPT-5, o3, Gemini 2.5 Pro, Claude 4.1 Opus, and new entrant, Grok 4. Beside gpt-oss, the strongest open weights model is Qwen. Pictured are the aggregate Intelligence Index, which combines multiple capabilities dimensions across 10 evaluations. (View Highlight)
With mounting competitive pressure from strong open-weight frontier reasoning models from DeepSeek, Alibaba Qwen and Google DeepMind’s Gemini and the US Government pushing for America to lead the way across the AI stack, OpenAI released their ﬁrst open models since GPT-2: gpt-oss-120b and gpt-oss-20b in August 2025. These adopt the MoE design using only 5.1B (of 120B) and 3.6B (of 20B) active parameters per token and grouped multi-query attention. Post-training mixes supervised ﬁnetuning and reinforcement learning, with native tool use, visible reasoning, and adjustable thinking time. However, in the community vibes post-release have been mid, in part due to poor generalisation (similar to MSFT phi models) potentially due to over distillation. (View Highlight)
The original Silk Road connected East and West through the movement of goods and ideas. The new Silk Road moves something far more powerful: open source models, and China is setting the pace. After years of trailing the US in model quality prior to 2023, Chinese models - and Qwen in particular - have surged ahead as measured by user preference, global downloads and model adoption. Meanwhile, Meta fumbled post-Llama 4, in part by betting on MoE when dense models are much easier for the community to hack with at lower scales. (View Highlight)
Meta’s Llama used to be the open source community’s darling model, racking up hundreds of millions of downloads and plentiful ﬁnetunes. In early 2024, Chinese models made up just 10 to 30% of new ﬁnetuned models on Hugging Face. Today, Qwen alone accounts for >40% of new monthly model derivatives, surpassing Meta’s Llama, whose share has dropped from circa 50% in late 2024 to just 15%. And this isn’t because the West gave up. Chinese models got a lot smarter and come in all shapes and sizes - great for builders! (View Highlight)
China’s RL tooling and permissive licenses are steering the open-weight community. ByteDance Seed, Alibaba Qwen and Z.ai are leading the charge with verl and OpenRLHF as go-to RL training stacks, while Apache-2.0/MIT licenses on Qwen, GLM-4.5 and others make adoption frictionless. Moreover, model releases come in many shapes and sizes to facilitate developer adoption of all ﬂavors. (View Highlight)
Prior clip models (Sora, Gen-3, Dream Machine, Kling) render ﬁxed videos you can’t steer mid-stream. World models predict the next frame from state and your actions, enabling closed-loop interactivity and minute-scale consistency. C (View Highlight)
DM’s Genie 3 generates explorable environments from a text prompt at 720p / 24 fps and its consistent for a few mins. ● Supports promptable world events (e.g. change weather, spawn objects with persistence). ● Shows early use for training embodied agents and even imagined worlds within the imagined world. (View Highlight)
From late‑2024, Chinese labs split between open‑weight foundations and closed‑source products. Tencent seeded an open ecosystem with HunyuanVideo, while Kuaishou’s Kling 2.1 and Shengshu’s Vidu 2.0 productize on speed, realism and cost. Models tend to use Diffusion Transformers (DiT), which replace convolutional U-Nets with transformer blocks for better scaling and to model joint dependencies across frames, pixels, and tokens. ● Tencent’s HunyuanVideo (13B) open‑sourced a transformer‑based diffusion model with a 3D‑VAE with evaluations reporting that it outperforming Runway Gen‑3 and Luma 1.6. Code/weights released. ● Open‑Sora 2.0 achieved commercial‑level quality from a ~$200k training run, reporting parity with HunyuanVideo and Runway Gen‑3 Alpha on human/VBench tests and narrowing the gap to OpenAI’s Sora. ● Kling 2.1 added 720p/1080p tiers and editor‑oriented controls, while Vidu 2.0 cut price (~¥0.04/s) and latency (~10 s to render 4 s@512p). (View Highlight)
OpenAI’s second-gen Sora adds synchronized dialogue and sound, stronger physics, and much tighter control over multi-shot scenes. It can also insert a short “cameo” of a real person with their voice and appearance into generated footage, and launches alongside an invite-only iOS app for creation and remixing. (View Highlight)
Open endedness describes a learning system that continually proposes and solves new tasks without a ﬁxed endpoint, selecting tasks that are both novel and learnable, and accumulating the resulting skills so they can be reused to reach further, harder tasks. Interactive and persistent world models make this increasingly feasible. (View Highlight)
AI is moving from answering questions to generating, testing, and validating new scientiﬁc knowledge. New “AI labs” organize coalitions of agent roles (PI, reviewers, experimenters) that ideate, cite, run code, and hand results back to human teams, shortening the loop from hypothesis to validation. (View Highlight)
● DeepMind’s Co-Scientist is a multi-agent system built on Gemini 2.0 that generates, debates, and evolves its approach to hypothesis generation and experimental planning. It was shown to propose drug candidates for AML (blood cancer) and new epigenetic targets for liver ﬁbrosis that were validated in vitro. In a subsequent blind test set by bacteriophage experts, Co-Scientist proposed the tail-hijacking mechanism for cf-PICI transfer that experiments later conﬁrmed. (View Highlight)
● Stanford’s “Virtual Lab” is a principal investigator plus specialist agents that hold “lab meetings,” plan workﬂows, and integrate protein structure tools (ESM, AlphaFold-Multimer, Rosetta). It designed 92 nanobodies including conﬁrmed binders to recent SARS-CoV-2 variants. (View Highlight)
AlphaEvolve: a coding agent for algorithm discovery and engineering impact | 55 → A recent example towards open-endedness for scientiﬁc research is DeepMind’s AlphaEvolve, an evolutionary coding agent that iteratively edits programs, tests candidates with automated evaluators, and promotes the best variants to discover novel solutions. Note the evaluation/ﬁtness functions are still deﬁned by the engineers. (View Highlight)
Meta’s FAIR trained UMA, a new family of universal interatomic potentials. These approximate the forces and energies between atoms, a task that usually demands resource-intensive quantum calculations (DFT). By replacing DFT with fast and accurate AI surrogates, UMA makes it possible to simulate materials, molecules, and sorbents at a scale that was previously unimaginable. They also produced the largest materials database. (View Highlight)
LLM-driven tree search writes expert-level scientiﬁc software across domains An LLM guided by a modiﬁed tree-search (“code mutation system”) generates, runs, and iterates code until it beats established leaderboards. The system recombines ideas, proposes new ones, and rigorously scores each attempt on public benchmarks, turning method invention into an automated search problem. Results span single-cell RNA-seq integration, COVID-19 forecasting, remote sensing, and numerical analysis. (View Highlight)
AlphaFold 3 reproductions: strong on familiar chemistry, weak on novelty | 64 → AlphaFold-3 can predict full multi-molecule complexes, inspiring many open-source reproduction efforts. These systems perform well when the binding site (“pocket”) and the way a molecule ﬁts into it (“pose”) look like examples the models have seen before. But when chemistry is new or different, accuracy falls. This shows that progress often reﬂects training-set familiarity more than true generalisation. (View Highlight)
Advancing multimodal foundation models and LLM decision support for healthcare Google’s MedGemma improves medical reasoning and understanding of text and images and Epic System’s Comet models are compute optimal electronic health record (EHR) foundation models. OpenAI’s AI Consult, a passive medical assistant, was tested on 39k primary care visits with Penda Health in Nairobi, Kenya. (View Highlight)
Attention sinks aren’t a bug…they’re the brakes | 69 → Attention is the transformer’s core mechanic. Many heads learn an ‘attention sink’ at the ﬁrst position that stabilizes computation by throttling over-mixing as depth and context grow. There’s been debate over why models learn this and what it’s for. (View Highlight)
The sink acts as a learned brake on mixing: by parking attention on the ﬁrst position, the model reduces cross-token sensitivity and becomes less reactive to small prompt perturbations; this effect strengthens with longer contexts and larger models. ● Training on longer contexts monotonically increases the sink metric. Within the LLaMA-3.1 family, strong sinks are present in ~78% of heads at 405B vs ~46% at 8B. ● In practice, this means the sink attaches to position 1. If ⟨bos⟩ was ﬁxed there during pre-training, removing it at inference collapses performance (e.g., RULER-4096 → 0 and large drops on ARC/HellaSwag). Handle ⟨bos⟩ and packing carefully. (View Highlight)
The trouble with benchmarks: vibes aren’t all we need, but increasingly all we’ve got Researchers revealed systematic manipulation of LMArena as Meta tested 27 private Llama-4 variants before cherry-picking the winner: testing 10 model variants yields a 100-point boost. (View Highlight)
LLMs are professional yes-men, and we trained them to be that way “Sycophancy” isn’t a bug, it’s exactly what human feedback optimisation produces. A study of ﬁve major LLMs shows they consistently tell users what they want to hear rather than the truth. ● Claude 1.3 apologises for being correct 98% of the time when users challenge it with “Are you sure?”, even when highly conﬁdent in the right answer. ● Human crowd-workers are part of the problem, since they also prefer well-written falsehoods when they can’t fact-check. The harder the topic, the more they reward conﬁdent nonsense. ● Best-of-N sampling with standard preference models consistently produces more sycophantic outputs than with truth-optimized preference models. ● Standard RLHF has a fundamental ﬂaw – models learn that agreeing with raters > truth because that’s literally what the training signal rewards. (View Highlight)
Brain-to-text decoding: decoding brain activity during typing | 73 Meta AI researchers developed Brain2Qwerty, a system that decodes what people are typing by reading brain signals from outside the skull, achieving a 19% character error rate for the best participants. This is a substantial improvement on previous non-invasive approaches (but is still far from clinical viability). (View Highlight)
Can vision models align with human brains…and how does that alignment emerge? | 74 By systematically varying model size, training scale, and image type in DINOv3 (Meta’s latest self-supervised image model trained on billions of images), researchers show that brain-model convergence emerges in a speciﬁc sequence. They ﬁnd that early layers align with sensory cortices, while only prolonged training and human-centric data drive alignment with prefrontal regions. Larger models converge faster, and later-emerging representations mirror cortical properties like expansion, thickness, and slow timescales. (View Highlight)
Scaling laws for brain-to-image decoding: data per subject matters (and costs bite) | 75 Meta AI benchmarked brain-to-image decoding across EEG, MEG, 3T fMRI and 7T fMRI using 8 public datasets, 84 volunteers, 498 hours of recordings, and 2.3M image-evoked responses, evaluated in single-trial settings. They ﬁnd no performance plateau: decoding improves roughly log-linearly with more recording, and gains depend mostly on data per subject rather than adding subjects. Deep learning helps most on the noisiest sensors (EEG/MEG). Estimated dollar-per-hour costs show 7T isn’t always the most cost-effective path. (View Highlight)
ATOKEN: a uniﬁed visual tokenizer for images, video, and 3D | 76 Apple introduced a single tokenizer that works for both high-ﬁdelity reconstruction and semantic understanding across images, video, and 3D could be the foundation layer for truly uniﬁed multimodal models. This approach reduces fragmentation, simplifying stacks, and enabling direct transfer of capabilities across modalities. (View Highlight)
Merging the virtual and physical worlds: pre-training on unstructured reality | 77 The new generation of robotic agents is built on a foundation of large-scale pre-training, but the key innovation is a move away from expensive, annotated datasets. The frontier is now focused on leveraging vast quantities of unlabeled, in-the-wild video to learn world models and physical affordances. (View Highlight)
The architectural divide: knowledge insulation vs. end-to-end adaptation | 78 With powerful, pre-trained Vision-Language-Action Models (VLAMs) serving as the “brains” of robotic agents, a critical architectural debate has emerged: should the entire model be ﬁne-tuned for a new physical task, or should the core knowledge be “insulated” by freezing its weights? (View Highlight)
The case for insulation: Pi-0.5 freezes the large VLM and ﬁne-tunes only small “action-expert” heads. This works because robot datasets are tiny, often ≤0.1% the size of the VLM pre-training corpora, so full-network tuning tends to overﬁt and forget general knowledge while costing more compute. ● The case for end-to-end: In contrast, models like ByteDance GR-3 and SmoLVLA show the upside of unfreezing when you have enough task data: the network can internalize contact, dynamics, and scene geometry. If robotics data approached VLM scale, end-to-end would likely dominate. (View Highlight)
Emergent reasoning moves into the physical world | 79 The “Chain-of-Action” pattern - explicit intermediate plans before low-level control - is becoming a standard for embodied reasoning. First shown by AI2’s Molmo-Act in 2025, and rapidly adopted by Gemini Robotics 1.5, this approach mirrors Chain-of-Thought in LLMs, boosting both interpretability and long-horizon reliability in real-world robotics. (View Highlight)
Reading the road: processing driving tasks in a uniﬁed language space | 80 Waymo’s EMMA is an end-to-end multimodal model that reimagines autonomous driving as a vision-language problem. By mapping camera inputs directly to driving-speciﬁc outputs (such as trajectories and road graph elements) and representing them in natural language, EMMA leverages the reasoning and world knowledge of LLMs to achieve SOTA results while offering human-readable rationales. (View Highlight)
Computer Use Agents (CUA) have improved by leaps and bounds, and still fall short | 81 Research labs like OpenAI, Anthropic, and ByteDance have been hard at work creating benchmarks and interaction methods for native language model computer use. While the use of RL and multi step reasoning has led to large improvements by and large the models still fall short. (View Highlight)
Could small language models be the future of agentic AI? | 82 Researchers from NVIDIA and Georgia Tech argue that most agent workﬂows are narrow, repetitive, and format-bound, so small language models (SLMs) are often sufﬁcient, more operationally suitable, and much cheaper. They recommend SLM-ﬁrst, heterogeneous agents that invoke large models only when needed. (View Highlight)
● Agents mostly ﬁll forms, call APIs, follow schemas, and write short code. New small models (1–9B) do these jobs well: Phi-3-7B and DeepSeek-R1-Distill-7B handle instructions and tools competitively. ● A ~7B model is typically 10-30x cheaper to run and responds faster. You can ﬁne-tune it overnight with LoRA/QLoRA and even run it on a single GPU or device. ● One can use a “small-ﬁrst, escalate if needed” design: route routine calls to an SLM and escalate only the hard, open-ended ones to a big LLM. In practice, this can shift 40-70% of calls to small models with no quality loss. (View Highlight)
Model Context Protocol becomes the “USB-C” of AI tools | 84 Introduced by Anthropic in late 2024, the Model Context Protocol (MCP) has quickly become the default way to plug models into data, tools and apps. In 2025 the big platforms moved to adopt it: OpenAI shipped MCP across ChatGPT, its Agents SDK and API; Google added MCP to Gemini; Microsoft built MCP into VS Code and began rolling it into Windows and Android Studio. With thousands of MCP servers now in the wild, the protocol is shaping how agentic systems are built and secured. (View Highlight)
● MCP offers one integration across clients (ChatGPT, Gemini, Claude/VS Code, LangChain, Vercel), collapsing one-off connectors and enabling tool discovery, resources and prompts over a common spec. ● Data from Zeta Alpha shows that the MCP protocol has been cited in 3x more research papers than Google’s competing A2A protocol (right graph). ● Security researchers estimate >15,000 MCP servers globally. Companies like Microsoft and Vercel are building guardrails and registries as the ecosystem matures. ● Early incidents (e.g. a malicious Postmark MCP server version on npm silently BCC’d users’ emails to an attacker until it was pulled) show why governance and package hygiene matter. (View Highlight)
Instead of consolidating, the agent framework ecosystem has proliferated into organized chaos. Dozens of competing frameworks coexist, each carving out a niche in research, industry, or lightweight deployment. ● LangChain remains popular, but is now one among many. ● AutoGen and CAMEL dominate in R&D with AutoGen in multi-agent + RAG studies, CAMEL in role-based conversations. ● MetaGPT thrives in software engineering, turning agents into structured dev workﬂows. ● DSPy rises as a research-ﬁrst framework for declarative program synthesis and agent pipelines. ● LlamaIndex anchors RAG workﬂows over enterprise documents. ● AgentVerse is used for multi-agent sim and benchmarking. ● LangGraph’s graph-based orchestration wins over developers who need reliability and observability. ● Letta / MemGPT explore memory-ﬁrst architectures, turning persistent memory into a framework primitive. ● Lightweight challengers like OpenAgents, CrewAI, and OpenAI Swarm highlight a shift toward composable, task-speciﬁc frameworks stateof.ai 2025 (View Highlight)
Building agents that remember: from context windows to lifelong memory | 87 Agent memory is shifting from ad-hoc context management to structured, persistent systems. The frontier is now beyond retrieval and into dynamic consolidation, forgetting, and reﬂection to allow agents to develop coherent identities across interactions, tasks, and even lifetimes… ● Memory is no longer a passive buffer, it is becoming an active substrate for reasoning, planning, and identity. Active areas of research are: ○ State-tracking and memory-augmented agents: reasoning enhanced by explicit state management ○ Persistent and episodic memory: long-term storage alongside short-term context for continuity. ○ Context retention: self-prompting and memory replay techniques to preserve relevance over extended tasks and interactions. (View Highlight)
Top AI conferences are being overwhelmed by unprecedented numbers of in submissions. “Proliﬁc authors” (who have 5+ papers accepted in a given conference) are also on the rise. This has led to drastic measures as conferences scramble to ﬁnd solutions: NeurIPS has allegedly demanded reviewers reject 300-400 papers originally recommended for acceptance. ● One NeurIPS reviewer took to Bluesky to criticise the suggestion to arbitrarily remove hundreds of papers that were originally recommended for acceptance. ● AAAI 2026 received an unprecedented 29k submissions this year – almost double last year. This has forced them to hire 28k Program Committee members. CoRL doubled capacity from 1.5k to 3k this year and sold out before papers were even accepted. (View Highlight)
RIP AGI, long live Superintelligence | 90 Executives of major AGI contenders, best exempliﬁed by Mark Zuckerberg, have rebranded their AGI efforts as superintelligence. No one knows what it means, but it’s provocative. It gets the people going. (View Highlight)
Days spent at the frontier The absolute frontier remains contested as labs continuously leapfrog one another. However, along two of the most prominent metrics, some labs have been on top longer than others in the past year. ● Timing the release of new models has become a science, meaning any one snapshot can paint a deceiving picture. The analysis below tracks the number of days each of the relevant labs spent atop each leaderboard. (View Highlight)
More for less: trends in capability to cost ratios are encouraging (Artiﬁcial Analysis) | 94 The absolute capabilities achieved by ﬂagship models continue to climb reliably while prices fall precipitously. (View Highlight)
On average, launches a new frontier model 44 days before closing a fundraising round. On average, launches a new frontier model 50 days before closing a fundraising round. On average, launches a new frontier model 77 days before closing a fundraising round. (View Highlight)
AI has clearly shifted from niche to mainstream in the startup and investing world. On Specter’s rank of 55M+ private companies, which tracks 200+ real-time signals across team growth, product intelligence, funding, ﬁnancials and inbound attention, AI companies now make up 41% of the Top-100 best companies (vs. 16% in 2022). Real-time interaction data between 30k investors and founders shows a surge of interest post-ChatGPT and peaking in late 2024, up 40x from the dark ages of 2020 when no one but true believers cared. (View Highlight)
A leading cohort of 16 AI-ﬁrst companies are now generating $18.5 B o f ann u a l i ze d re v e n u e a so f A ug ‘25 (l e f t) . M e an w hi l e, ana 16 z d a t a se t s ugg es t s t ha tt h e m e d ian e n t er p r i se an d co n s u m er A I a pp s n o w re a c hm ore t han$ 2M ARR and $4 M A RR in ye a ro n e, res p ec t i v e l y . N o t e t ha tt hi s w i ll f e a t u res i g ni ﬁ c an t s am pl e bia s, a se v i d e n ce d b y t h e b o tt o m q u a r t i l e n o t b e in g c l ose t o$ 0. Furthermore, the Lean AI Leaderboard of 44 AI-ﬁrst companies with more than $5 M A RR, < 50 FTE, an d u n d er ﬁ v eye a rso l d (e . g . in c l u d es M i d j o u r n ey, S u r g e, C u rsor, M ercor, L o v ab l e, e t c) s u m so v er$ 4B revenue with an average of >$2.5M revenue/employee and 22 employees/co. (View Highlight)
Analysis of the 100 fastest revenue growing AI companies on Stripe (AI 100) reveals that, as a group, they are growing from launch to $5 M A RR a t a 1.5 x f a s t err a t e t han t h e t o p 100 S aa S co m p ani es b yre v e n u e in 2018 (S aa S 100) . Wi t hin t h e A I 100, t h e g ro wt h r a t es b e tw ee n U S an d E u ro p e an co m p ani es a rero ug h l ye q u i v a l e n t, w h ere a sco m p ani es f o u n d e d in / a f t er 2022 a re g ro w in g t o$ 5M ARR 4.5x faster than those founded before 2020 and 1.5x faster than those founded after 2022, which exempliﬁes the commercial pull of generative AI products. Note that we don’t know the total population size of companies from which these were sampled. (View Highlight)
AI-ﬁrst companies continue to outperform other sectors as they grow | 100 Analysis of 315 AI companies with $1 - 20 M inann u a l i ze d re v e n u e an d 86 A I co m p ani es w i t h$ 20M+ in annualized revenue, which constitute the upper quartile of AI companies on Standard Metrics, shows that they both outpaced the all sector average since Q3 2023. In the last quarter, $1 - 20 M re v e n u e A I co m p ani es w ere g ro w in g t h e i r q u a r t er l yre v e n u e a t 60$ 20M+ revenue AI company grew at 30%, in both cases 1.5x greater than all sector peers. (View Highlight)
Ramp’s AI Index (card/bill-pay data from 45k+ U.S. businesses) shows paid AI adoption rose from 5% in Jan ’23 to 43.8% by Sept ’25, while U.S. Government estimates trail at 9.2%. Cohort retention is improving signiﬁcantly with 12 month retention of 80% in 2024 vs. circa 50% in 2022. Average contract value jumped from $39 k (’23) t o$ 530k (’25), with Ramp projecting ~$1M in ’26. Pilots are now becoming large-scale deployments. (View Highlight)
Ramp’s AI Index (card/bill-pay data from 45k+ U.S. businesses) shows the technology sector unsurprisingly leading in paid AI adoption (73%) with the ﬁnance industry not far behind (58%). Across the board, adoption jumped signiﬁcantly in Q1 2025. Moreover, Ramp customers exhibit a strong proclivity for OpenAI models (35.6%) followed by Anthropic (12.2%). Meanwhile, there is very little usage of Google, DeepSeek and xAI. (View Highlight)
Audio, avatar and image generation companies see their revenues accelerate wildly | 103 Market leaders ElevenLabs, Synthesia, Black Forest Labs are all well into hundreds of millions of annual revenue. Moreover, this revenue is of increasingly high quality as its derived from enterprise customers and a long tail of >100k customers and growing. ● ElevenLabs grew annual revenue 2x in 9 months to $200 M an d ann o u n ce d a$ 6.6B valuation commensurate with a $100 M e m pl oyee t e n d ero ff er . C u s t o m ers ha v ecre a t e d > 2 M a g e n t s t ha t ha v e han d l e d > 33 M co n v ers a t i o n s in 2026.● S y n t h es ia crosse d$ 100M ARR in April ‘25 and has 70% of the Fortune 100 as customers. Over 30M avatar video minutes have been generated by customers since launch in 2021 (right chart). ● Black Forest Labs is said to be at ~ $100 M A RR (u p 3.5 x Y o Y) w i t h 78$ 140M over two years. Separately, Midjourney has also entered into a licensing deal with Meta, the terms of which aren’t known. (View Highlight)
The duality of GPT-5: today’s best model was clouded by the worst launch GPT-5 leads the intelligence-per-dollar frontier with best-in-class benchmarks at 12x lower cost than Claude, but user backlash over the sudden removal of GPT-4o and o3 and concerns/teething problems about opaque router decisions that users aren’t used to overshadowed its technical achievements (View Highlight)
GPT-5 is ostensibly a brilliant model: it swept the leaderboard on LMArena and has a 400K context window in the API. OpenAI now dominates the intelligence per dollar frontier for the ﬁrst time. ● The rollout wasn’t ideal: Altman held an emergency Reddit AMA to address the abrupt removal of previous models and viral ‘chart crimes’. (View Highlight)
The removal of GPT-4o and o3’s familiar personality upset users, which was ironic given the same launch introduced custom personas. ● We’ve previously hypothesised that model companies would dynamically route queries to right-sized models for latency and cost reasons. GPT-5 is the ﬁrst major chat system to introduce this using a router as the consumer endpoint. Users can opt for faster responses by hitting “skip” if the model invokes its more capable version. In practice, people will take time to acclimate to this UX: the perceived opacity of model selection has led to a ﬂurry of complaints. (View Highlight)
As Google ﬂipped the switch on enabling Gemini features within an increasing number of its properties and toggling more users into their AI search experience, the company reported a yearly 50x increase in monthly tokens processed, recently hitting a quadrillion tokens processed each month. Meanwhile, OpenAI reported similar growth in token volume last year. (View Highlight)
● The demand for tokens has been largely supercharged by improved latency, falling inference prices, reasoning models, longer user interactions, and a growing suite of AI applications. Enterprise adoption has also continued to pick up in 2025. ● Surging inference demand will place additional pressure on AI supply chains, particularly power infrastructure. ● But given that all tokens are not created equal, we’d caution against deriving too much signal from aggregate token processing ﬁgures. (View Highlight)
Models are getting seriously good at coding, with OpenAI pulling ahead | 106 GPT-5 and Gemini 2.5 Deep Think would have placed ﬁrst and second respectively in the most prestigious coding competition in the world (without having trained with this competition in mind). GPT-5 solved all 12 problems, with 11 on the ﬁrst try. Previously, Anthropic had enjoyed a period of relatively uncontested dominance in programming tasks. (View Highlight)
● For the International Collegiate Programming Contest (ICPC) World Finals, an OpenAI Researcher explained how they had GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions to submit. GPT-5 answered 11 correctly, and the last (and most difﬁcult problem) was solved by the experimental reasoning model. ● The OpenAI Codex team have been cooking: Sam Altman claimed GPT-5-Codex usage had increased 10x, and their internal code review bot became so valuable that developers were “upset when it broke” because they lost their “safety net (View Highlight)
AI writes the code, founders cash the checks… ● Swedish vibe coding startup Lovable became a unicorn just 8 months after launch, with a $1.8 B v a l u a t i o n .● U s in g A I t o w r i t e 90$ 80M after 6 months. ● Garry Tan says that for 25% of their current fastest growing batch, 95% of their code was written by AI. (View Highlight)
June 2021 GitHub Copilot launch introduces inline code suggestions and “pair programmer” concept 2023 From autocomplete to conversation: AI coding tools begin writing code from natural language prompts Today AI-native IDEs: more like a full-time engineer than assistant, writes code with minimal oversight (View Highlight)
but vibe coding your products can be risky | 108 Security breaches, code destruction… ● Malicious actors hijacked an open-source Cursor IDE extension to steal credentials and mine $50, 000 cry pt oc u rre n cyo n d e v e l o p er ma c hin es .● T h ere ha v e b ee nman yre p or t so f A I co d in g t oo l s a gg ress i v e l yo v er w r i t in g p ro d u c t i o n co d e, w i t h d e v e l o p ers l os in g w ee k so f w or k d u e t oo v erze a l o u s A I " im p ro v e m e n t s ".● Des p i t e$ 200M+ valuations, AI coding startups face brutal unit economics: new model releases bring higher token costs, forcing startups to either eat losses, surprise users with price hikes, or restrict access to older, less capable models. (View Highlight)
As the costs rack up it’s unclear who is making money | 109 Coders love Claude Code and Cursor, but margins are fragile. The tension is stark: Cursor is a multi-billion dollar company whose unit economics are hostage to upstream model prices and rate limits set by its own competitors. ● Some users are costing upwards of 50k/month for a single seat of Claude Code. Cursor and Claude have introduced much stricter usage limits to try to crack down on the costs of power users. ● Cursor’s pricing power is limited because its core COGS are the API prices of Anthropic/OpenAI. When those providers change prices, rate limits, or model defaults, Cursor’s gross margin compresses unless it caps usage or shifts workloads off upstream APIs. (View Highlight)
The M word: so what about the margins? | 110 Gross margins are generally commanded by the underlying model API and inference costs and strained by token-heavy usage and trafﬁc acquisition. Surprisingly, several major AI companies don’t include the costs of running their service for non-paying users when reporting their GM. Coding agents are under pressure even when revenue grows quickly. The primary levers to improve margins are moving off third-party APIs to owned or ﬁne-tuned models, aggressive caching and retrieval efﬁciency, and looking to ads or outcomes-based pricing. (View Highlight)
Dario Amodei: “If every model was a company, the model—in this example—is actually proﬁtable.” Despite high burn rates, speculation indicates many of the frontier labs enjoy strong unit economics on a ﬂagship model basis. ● AI labs mirror the foundry business: staggering investments are needed for each successive generation, where labs bear the front-loaded training expense. While recent models allegedly recoup this cost during deployment, training budgets surge. Pressure then mounts to drive inference revenue across new streams. ● Inference pays for training: labs strive to allocate more of a model’s lifecycle compute to revenue-generating inference at the steepest margin possible. Our table* below illustrates the expected return on compute costs across varying inference margins and compute allocations. (View Highlight)
Browsers become the latest AI battleground | 112 Users live in the browser, so why shouldn’t AI be baked into the experience? This is ﬁnally happening. OpenAI, Google, Anthropic, and Perplexity all launched assistants that not only unlock Q&A with web content but also navigate and act within the browser on behalf of the user. This shift reframes the browser as an intelligent operating system for the internet, a long sought vision that earlier attempts like Adept AI never fully realized. ● OpenAI rolled out ChatGPT Search, combining real-time web results (by searching Google) with chat and a new Agent that spins up a virtual browser to execute multi-step tasks via tool use within ChatGPT that users can control. ● Taking another route, Perplexity built their own Chrome-based browser, Comet, with a native AI assistant sidebar. It can perform Q&A but also complete multi-step tasks in the browser (ﬁlling forms, scraping), again with user oversight. ● Anthropic and Google released limited previews of Claude for Chrome and Gemini in Chrome, respectively, that also let users operate the browser and access Q&A functionality. Anthropic no longer deems this use case to be dangerous. ● In Sept ‘25, Atlassian acquired The Browser Company (makers of Arc) further reafﬁrming that browsers are the latest AI battleground. (View Highlight)
As AI search engines surge, Google’s search and ad offering is taking heat… | 113 With 700M weekly active users, ChatGPT is evangelising AI-powered search to the masses, reshaping how people discover and use information. Google’s once unshakeable dominance shows the ﬁrst signs of erosion, even as they pivot to AI Overviews (AIO) and AI Mode within Google Search. Moreover, AIO has driven a ~90% drop in Search click-throughs, harming traditional ads, but ignores users being inﬂuenced by the AI answers. ● By August 2025, ChatGPT served 755M monthly users, giving it ~60% of the AI search market. ● SEM Rush data shows Google’s global search trafﬁc fell ~7.9% year-over-year, the ﬁrst signiﬁcant dip in decades, even as it retained ~90% global share. Similarweb data shows Google search visits down 1-3% YoY throughout H1 2025, with Bing (-18%) and DuckDuckGo (-11%) also declining. ● Perplexity queries hit 780M in May 2025, growing 20% month-over-month, as citation-rich answers drew loyal users. ● “ChatGPT” itself became a top Google search term with 618M monthly queries, rivaling “Facebook”. (View Highlight)
AI search is emerging as a high-intent acquisition channel | 114 According to Similar Web data, retail visits referred by ChatGPT now convert better than every major marketing channel measured. Conversion rates rose roughly 5pp YoY, from ~6% (Jun ’24) to ~11% (Jun ’25). Although AI referrals are still a smaller slice of trafﬁc, they arrive more decided and closer to purchase. Retailers must adapt by exposing structured product data, price and delivery options, and landing pages tailored to AI-driven intents. In fact, ChatGPT recently implemented Instant Checkout with Etsy and Shopify, and open-sourced their Agentic Commerce Protocol built with Stripe, to enable developers to implement agentic checkout. (View Highlight)
but AI can’t get away from Google Search | 115 While the LLM based interface has been the focus of funding, lawsuits, and user behavior, no one has found a good alternative to using Google search. ● Despite strategic partnerships with Microsoft and access to Bing OpenAI chooses to scrape Google search results as its web search system. ● During Google’s antitrust trial access to a quality web index was discussed by Anthropic, OpenAI, Perplexity, etc. Seeking to be able to create the same quality as google for 80% of queries. ● Remediations from the the trial will grant ‘Qualiﬁed Competitors’ a one time index dump without any of the ranking signals. It’s unclear if this dump will lead to any real competitor to the search system Google has been perfecting for decades. (View Highlight)
Data from answer engine optimization company Profound shows that users treat AI answer engines differently from Google. Sessions are longer, with more back-and-forth, suggesting higher intent and better conversion potential. Answer engines are no longer just a curiosity - they’re a primary entry point for serious queries. ● In an average session, users send ~5 prompts and receive ~5 responses, far more interaction than a typical search query, scroll, and blue-link click. ● Profound data shows ChatGPT users average 5.6 turns per session, versus ~4 for Gemini and Perplexity, and ~3.9 for DeepSeek. Either more turns means more engaging conversations, or fewer turns means answers are given more efﬁciently. ● Conversation styles differ: DeepSeek users write the longest prompts and get the most verbose answers, while Perplexity delivers shorter, citation-heavy responses. ● This iterative style and memory capability makes answer engines “sticky” and explains why they already deliver higher conversion rates than Google. ● Profound’s analysis shows ChatGPT’s crawler is now among the top 10 most active bots on the internet, alongside Googlebot and Bingbot. (View Highlight)
So where do answer engines get their answers? | 117 Understanding how AI answer engines cite and retrieve information is critical for visibility on AI-ﬁrst web. Profound’s analysis shows ChatGPT draws heavily from Google’s index but distributes attention differently across the web, with lower-ranked pages often getting visibility. This behavior changes with new model versions too. ● Profound data shows GPT-5’s citations matched 19% of Google domains when compared against the top 10 Google results, underscoring both reliance on Google’s index and a broader sourcing pattern. ● Avg citation position also shifted down the page, while the median stayed at #9, meaning that ChatGPT is just as likely to surface content further down Google’s results page. ● ChatGPT often pulls from lower-ranked pages than humans typically click, widening exposure for sites beyond the top results. ● Top domains cited across models: Reddit (3.5%), Wikipedia (1.7%), YouTube (1.5%), and Forbes (1.0%). ● Different models show sourcing styles: Gemini and Perplexity lean toward mainstream concise sources, while DeepSeek tends to draw on long-form domains. ● This means optimizing for Answer Engine Optimization (AEO) is as important as SEO because visibility depends not just on rank, but on model citation patterns. (View Highlight)
2025 is the year when “if you can’t beat ‘em, join ‘em” became ofﬁcial media strategy for AI companies. ● News: Over 700+ news brands have signed AI deals, including the Washington Post, WSJ, Guardian, FT, The Atlantic, Condé Nast, and even NYT ($20-25M Amazon deal) (as they continue to sue OpenAI). ● Music: Hallwood pens deal with top-streaming “creator” on Suno, Grammy winner Imogen Heap releases AI style-ﬁlters for fans to remix. ● Video: AMC Networks formally embraces Runway AI for production (ﬁrst major cable network to do so). ● Publishing: Microsoft & HarperCollins deal for AI training (with author opt-outs). (View Highlight)
Shortly after the announcement of their record-breaking $13 BS er i es F, A n t h ro p i c a g ree d t o p a y$ 1.5B to settle a class action lawsuit from book authors. This is the largest payout in the history of US copyright cases, and constitutes what some describe as “the A.I. industry’s Napster moment”. ● This does not set legal precedent as the case did not go to trial, but is a very signiﬁcant development in the ongoing fair use debate. Since this was an “opt-out” class action, authors who are eligible can request exclusion to ﬁle independent lawsuits. Anthropic also agreed to delete works that had been downloaded. ● In June, a judge sided with Anthropic, ruling that training LLMs on legally purchased books was sufﬁciently transformative to constitute fair use. He also ruled that training on pirated copies was illegal. Previously Anthropic had hired Tom Turvey, the former head of Google Books, who mass bought physical books and then created digital copies that were used for model training. ● During a deposition, co-founder of Anthropic Ben Mann testiﬁed to having downloaded the LibGen dataset (which contains pirated material) when he previously worked at OpenAI… (View Highlight)
Welcome to the Stargate: may the FLOPS be with you | 121 10 years ago, Baidu, Google and others had shown early scaling laws whereby deep learning models for speech and image recognition converged faster as more GPUs were used for training. Back then, this meant using up to 128 GPUs. In January this year, Sam Alman, Masayoshi Son, Larry Ellison and President Trump announced the Stargate Project, a gigantic 10GW-worth of GPU capacity to be built with $500B in the US over 4 years. This equates to over 4 million chips! The buildout is to be funded by SoftBank, MGX, Oracle, and OpenAI. (View Highlight)
OpenAI franchises sovereign AI with its “OpenAI for Countries” program | 122 Energy-rich nations are grabbing their ticket to superintelligence by partnering with OpenAI’s astute sovereign offering: a formalized collaboration to build in-country AI data center capacity, offer custom ChatGPT to citizens, raise and deploy funds to kickstart domestic AI industries and, of course, raise capital for Stargate itself. ● Stargate UAE is the ﬁrst deployment of a 1GW campus with a 200 MW live target in 2026. Partners include G42, Oracle, NVIDIA, Cisco, and SoftBank). ● Stargate Norway comes second and is a 50/50 joint venture between UK’s Nscale and Norway’s Aker to deliver 230 MW capacity (option to +290 MW) and 100,000 GPUs by end-2026. ● Stargate India is reportedly in the works for 1 GW as OpenAI expands and offers a cheaper “ChatGPT Go”. (View Highlight)
OpenAI races to own entire AI stack | 123 After shelving its robotics program in 2020 to focus on language models, OpenAI has reversed course, now driving full vertical integration from custom chips and data centers to models, devices, and embodied AI. Hardware & Robotics Consumer Devices: $6.5B acquisition of io (Jony Ive) to create AI-native devices, bypassing existing iOS/Android. Robotics: Internal robotics division reboot; partnership with Figure AI (since terminated). Models Silicon & Compute Custom Chips: Developing in-house AI processor in partnership with Broadcom, targeting 2026 launch, to cut NVIDIA reliance. Data Centers: Texas Stargate supercluster: 400k GPUs, 1.2 GW capacity, Oracle partnership; part of$ 500B build-out to secure compute supply. (View Highlight)
Broadcom’s great transformation | 124 Once an unglamorous semiconductor ﬁrm, Broadcom has now positioned itself at the cutting edge of the AI revolution through its custom chip partnerships with Google, Meta, and reportedly OpenAI. The development of custom AI chips like Amazon’s Trainium and Broadcom’s TPUs / MTIA chips gives frontier labs more leverage when negotiating multi-billion dollar deals with NVIDIA. ● Broadcom’s 2013 LSI acquisition included a small custom chip unit that now designs Google TPUs and Meta’s AI chips, growing from <20% of LSI’s revenue at acquisition to $2 - 3 B + ann u a ll y .● B ro a d co m^{'} ss t oc k p r i ce ha ss u r g e d, s i g na l in g in v es t oro pt imi s mab o u tt h eco m p an y^{'} s abi l i t y t o b e n e ﬁ t f ro m t h er a p i d l y g ro w in g A I c hi p ma r k e t .● B ro a d co m^{'} s A I c hi p re v e n u ere a c h e d$ 5.2B in Q3 2025, up 63% year-over-year. ● The custom chip ecosystem puts pressure on NVIDIA’s monopoly: Amazon’s in-house Trainium chips and Broadcom-powered alternatives (Google TPUs, Meta MTIA, OpenAI’s upcoming chips) give hyperscalers multiple paths to reduce NVIDIA dependence, if they’re willing to endure the user pain. (View Highlight)
OpenAI and its benefactor, Microsoft, navigate a rocky relationship | 126 OpenAI’s recent restructuring and soaring demand for training compute has placed tremendous stress on its relationship with Microsoft. While signs of fracturing become noticeable, a complete divorce looks unlikely. ● Microsoft’s inability, or unwillingness, to bring training compute online fast enough has impacted OpenAI’s roadmap. As OpenAI has gravitated closer to Oracle to fulﬁll these needs, Microsoft has abstained from exercising its “right of ﬁrst refusal.” They appear hesitant to bet on next-generation centralized clusters. ● Meanwhile, OpenAI appears to be trying to escape from other key elements of their partnership with Microsoft. Through 2030, Microsoft maintains a 20% revenue sharing arrangement, access to OpenAI’s IP, and exclusivity on OpenAI’s API. Yet, OpenAI wants the 20% share dialed back to 10% before the end of the decade. ● Microsoft can always work to block the for-proﬁt’s conversion into a PBC, which could cost OpenAI $20B in funding if not completed by the end of 2025. Conversely, OpenAI always retains the option to air antitrust concerns if Microsoft proves adversarial or reneges on certain AGI clauses. “We are below them, above them, around them.” ● Microsoft AI also released previews of a voice and MoE model trained on 15k H100s. (View Highlight)
Oracle steps up as a key buildout partner for AI infra | 127 As Microsoft has dialed back its willingness to shoulder so much of the future AI infra buildout, Oracle has begun picking up slack. During the early phase of this shift, Oracle has been rewarded as its stock soars. ● OpenAI has reached a $30 Bp erye a r a g ree m e n tw i t h O r a c l e f or d a t a ce n t erser v i ces . T hi s d e a l m ore t han d o u b l e d O r a c l e ’ sco ll ec t i v e ﬁ sc a l 2025 c l o u d ser v i cere v e n u e, w h e ni t so l d j u s t$ 24.5B worth of services. ● This news comes as OpenAI’s deal with Softbank begins to fray. Recently, original Stargate Project plans were scaled-back, tempering those roadmaps. ● Oracle now ﬁlls this vacuum, proving to have found the risk-appetite and follow-through that both Microsoft and SoftBank seem to lack. As a consequence, Oracle’s stock has jumped more than >70% year to date. ● Oracle’s current track is not without major risks. Providing large-scale clusters has not been exceptionally high-return, particularly as power bottlenecks are costly to overcome. AI lab tightness could eventually raise issues for longer leases. Finally, depreciation cliffs present concerns and the economics of converting these clusters to inference ﬂeets remains murky. More decentralized builds could then win out, especially if scaling continues to shift to RL. (View Highlight)
AI labs target 2028 to bring online 5GW scale training clusters | 128 Anthropic shared expectations that training models at the frontier will require data centers with 5GW capacity by 2028, in line with the roadmaps of other labs. The feasibility of such endeavors will depend on many factors: ● How much generation can hyperscalers bring behind-the-meter? At such scale, islanding will likely not be practical, requiring data centers to tap into grid assets. ● How quickly can players navigate the morass of permitting and interconnection? While reforms are underway, connection timelines for projects of this magnitude can take many years. Hyperscalers may skip queues through lobbying efforts and demand response programs, where they curtail draw during peak periods (a recent Duke study projects 76GW of new availability with curtailment rates at 0.25%). ● What level of decentralization can be achieved? Many labs continue to pursue single-site campuses, yet distributed approaches are also advancing rapidly. ● How well can hyperscalers navigate talent and supply chain shortages? Attempts to alleviate power infrastructure and skilled labor bottlenecks through the massive mobilizations of capital can overload the risk-appetite of supporting partie (View Highlight)
NERC reported that electricity shortages could occur within the next 1-3 years in several major US regions. DOE warns blackouts could be 100 times more frequent by 2030 due to unreliability and new AI demand. ● Similarly, SemiAnalysis projects a 68 GW implied shortfall by 2028 if forecasted AI data center demand fully materializes in the US. ● As an emerging pattern, this will force ﬁrms to increasingly offshore the development of AI infrastructure. Since many of the US’ closest allies also struggle with electric power availability, America will be forced to look toward other partnerships – highlighted by recent deals in the Middle East. ● Projects that are realized on American soil will place further strain on the US’ aging grid. Supply-side bottlenecks and rapid spikes in AI demand threaten to induce outages and surges in electricity prices. ICF projects residential retail rates could increase up to 40% by 2030. These factors could further contribute to the public backlash directed at frontier AI initiatives in the US. (View Highlight)
The widespread AI buildout places data center emissions on a steeper trajectory, while creative carbon accounting techniques continue to understate the true emissions associated with many hyperscalers. ● Data center-related emissions could surge as more providers island with natural gas plants and grid operators recommission or delay the retirement of existing coal plants. ● As AI factories are ultimately offshored to other regions, the carbon intensity of those locations is likely to be much higher than that of the US, requiring cloud providers to pursue more aggressive procurements of carbon offsets. ● Deceptive carbon accounting practices are also prominent. Some hyperscalers omit upstream categories, such as the emissions associated with manufacturing IT equipment and constructing or maintaining the relevant power plants. Furthermore, additionality agreements can cover new renewable projects that were already planned to commence. (View Highlight)
AI factories withdraw considerable amounts of water and are more likely to be built in high water-stress areas. Still, the Water Usage Efﬁciency (WUE) of most AI factories are trending in a favorable direction. ● An average 100 MW hyperscale data center in the US consumes roughly 2M Liters per day, mostly due to the indirect toll associated with power generation. Yet, as modern AI data centers shift to closed-loop liquid cooling solutions, their WUE plummets relative to other traditional data centers. ● However, second-order costs cannot be ignored as the number of new AI factories continues to surge, leading to more generation coming online. In the US, this creates a geographic mismatch: the sites where power is available or can be easily built often sit in drier climates with water stress. This trend may be exacerbated by the shift to water-intensive generation sources – particularly nuclear, coal, & certain natgas plants. ● Currently, everyday AI usage carries negligible impacts – a typical Gemini app text prompt consumes only 0.26 mL of water (~5 drops). Yet, WUE must be monitored as AI interactions continue to use more tokens. (View Highlight)
Google’s commitment signals demand for an energy source that will not be deployment-ready until early next decade, kicking off a new wave of investment. ● While support for next-generation energy sources holds strong, fusion remains many years away from providing a scalable solution to AI’s demand for power. AI’s energy footprint traces a very steep curve, yet the timelines of next-generation sources are not compressing. ● This marks another shift: hyperscalers, rather than the US government, are increasingly shouldering investments in technologies that are many years away from commercial viability – such as fusion, quantum computing, and advanced AI development. (View Highlight)
Chinese labs like Alibaba, DeepSeek, Moonshot, and MiniMax continue to release impressive open-weight models. A capability gap emerges between these models and most American open-source alternatives. ● US open model efforts have disappointed. OpenAI’s open-weight models underwhelmed with performance trailing far behind GPT-5. ● The restructuring of Meta’s “Superintelligence” team has cast doubts on their commitment to open-sourcing at the frontier. Other teams like Ai2 lag far behind in terms of funding. Although they recently landed $152M from NVIDIA and NSF, that ﬁgure pales in comparison to even OpenAI’s initial grant from 2015. ● Conversely, Chinese organizations continue to push the envelope, while publishing troves of new algorithmic efﬁciency gains. (View Highlight)
yet is this an overt strategy or a side-effect of China’s inability to scratch the frontier | 136 China’s commitment to the open-source community could be a lasting tactic or a short-term play exercised to reach frontier-level capabilities. It has already proven an effective method in catching-up to the pack. ● While open-source projects can successfully build mind-share, competitive realities exist. Proprietary options make greater commercial sense once a lead has been established, easing the generation of returns and protecting algorithmic unlocks. ● Yet, China’s recent AI Action Plan did include an entire section dedicated to upholding the responsibility of “[building] transnational open source communities and safe and reliable open source platforms.” This theme has been a familiar theme in other messaging produced by the CCP. ● Other Chinese AI leaders, such as Liang Wenfeng, have grown invested in open source culture, viewing their contributions as a means of earning global recognition and “respect.” (View Highlight)
BIS originally sent letters to NVIDIA & AMD announcing the requirement of licenses for the sale of H20 & MI308 chips to China, effectively halting sales. Months later, the Trump administration walked back these controls. ● The CCP immediately denied claims that this concession was linked to ongoing trade negotiations, sparking speculation the move was instead intended to contain Chinese chipmakers such as Huawei. Others view this pivot as an attempt to ensure NVIDIA plays ball amid location veriﬁcation initiatives like the Secure Chip Act. ● NVIDIA welcomed this shift, announcing its plan to fulﬁll existing orders. However, Chinese CSP’s have canceled these purchases and the production of H20 line recently halted. Instead, NVIDIA await directives from both countries in its hopes to launch a B30A line, based on the Blackwell architecture. ● Due to strategic interdependencies, attempts to deepen China’s dependence on the American AI accelerator ecosystem carries tradeoffs at the model layer. In this scenario, Chinese labs can continue to tap into high bandwidth compute, improving their ability to both serve customers and develop RL-heavy reasoning models. Although smuggling appears inevitable, export decisions represent a swinging pendulum between these two layers of the stack (View Highlight)
After U.S. export ﬂip-ﬂops and Lutnick remarking that “You want to sell the Chinese enough that their developers get addicted to the American technology stack, that’s the thinking,” Beijing pivoted from mitigation to home-grown substitution. Regulators steered demand off NVIDIA while fabs and suppliers scaled domestic options. (View Highlight)
Three Huawei-serving fabs and leading foundry SMIC (Semiconductor Manufacturing International Corp.) plan ramps that could triple China’s AI-chip output in 2026; SMIC also aims to double 7nm capacity. ● DeepSeek’s FP8 format is guiding domestic designs, while CXMT (ChangXin Memory Technologies) is testing HBM3 for local stacks. ● Cambricon is an early winner, posting Rmb 1B H1 proﬁt on a 44× revenue jump as ByteDance/Tencent shift to homegrown inference chips. Its has ripped over 100% since the news. ● China can afford to build systems that are less efﬁcient in terms of ﬂops/watt because they are not power constrained. (View Highlight)
Rampant smuggling also shifts export control calculus | 139 During the temporary H20 ban, $1B worth of NVIDIA chips were smuggled to China. Markups are rumoured to ﬂoat around 50%, relatively low for black-market products, suggesting a deep supply of diverted GPUs in China. ● Smuggling patterns appeared to intensify during the ban, with a sharp drop off following the reversal. Based on this relationship, Chinese AI efforts seem to prefer defanged NVIDIA chips over domestic offerings and smuggled GPUs, which carry markups, compliance risks, and lacking support. Sliding-scale restrictions could work to weaken China’s smuggling muscle, directing more SOTA chips to the West. ● Similarly, NVIDIA, who long maintained a stance denying any evidence of diversions, ﬁnally recognized ongoing smuggling. NVIDIA framed such activity as a “losing proposition,” since they only provide service for authorized data center products. Steps to prevent future diversions are technically feasible through location-based attestation ﬁrmware, yet these mitigations are not completely bulletproof. (View Highlight)
China plans to build a sprawling ﬂeet of 39 new AI data centers, largely in Xinjiang and Qinghai, using unauthorized Hopper GPUs. Of the redirected H100s and H200s, roughly 80k are designated to be deployed in a single state-owned cluster in Yiwu county. ● News of the buildout points to the scale and sophistication of the black-market operations in China. State-involvement also suggests CCP leadership has begun to awaken in the frontier AI competition. ● The centralized cluster will meaningfully advance the scale available to Chinese labs. Published claims surrounding SOTA Chinese models indicate today’s systems have been trained using 1k-10k GPU (View Highlight)
From DeepSeek’s deep freak out morphs into a full-tilt on Jevons Paradox | 141 Panic!!! Frontier AI for $5 M!! M a r k e t s w i p e$ 600B from NVIDIA in 1 day!! … but wait, $5M* is only for the ﬁnal training run not the entire project… Cheaper intelligence → more demand → more chips ⇒ more usage (View Highlight)
Assessing the current pros and cons of modern chip export controls | 143 PROS ● Worse options can stunt AI investment. There is already a 10:1 AI capex gap between the US & China. ● Fewer aggregate ﬂops blocks adoption by cutting off the number of agents and assistants available. ● Frees up capacity for American AI efforts. ● Destroys the bridge available to adversaries as they continue to pursue self-reliance. ● Dries up the resources available to adversaries for AI-related military-civil fusion projects. ● Gatekeeping offers government revenue streams. ● Cutting off supply introduces constraints that makes it harder for adversaries to export their stack. CONS ● Developers that are forced off the American stack then bolster foreign software ecosystems. ● Lost cash ﬂow can cause a drag on US R&D/M&A, while supporting the spend of foreign competitors. ● Stricter controls incentivize smuggling operations. ● Controls can indiscriminately block the beneﬁts of AI diffusion, provoking retaliation (e.g. REE controls). ● Enforcement challenges can strain relationships with nations where channels for diversion exist. ● Enacting defensive measures, like location-based guardrails, could make foreign options more attractive if overreach is suspected. (View Highlight)
AI supercomputer supremacy: US domination and corporate concentration | 144 The US controls ~75% of global AI supercomputer capacity with 850,000 H100-equivalents compared to China’s 110,000. What’s more, the concentration of compute power has shifted from public to private hands, with companies now controlling 80% of AI supercomputers (up from 40% in 2019). Despite this massive compute advantage and export controls, China is consistently shipping very capable open weight models - more frequently and across the spectrum of modalities. (View Highlight)
Taiwan continues to reign supreme in terms of leading-edge manufacturing capacity, maintaining massive advantages in both generation and volume. ● Thus far, SME export controls appear to have proven effective at slowing Chinese capabilities. Expectations remain that SMIC will launch full-scale 5nm operations early next year, yet yields are unlikely to reach the levels of other industry leaders. Furthermore, this limited capacity must also be spread across a wide base of other consumer products – such as cell phones and laptops. ● TSMC Fab 21 Phase 1 has worked to onshore critical capacity back to the United States. While AMD will direct some of this capacity toward the production of its AI chips, advanced packaging will still be performed in Taiwan. The US remains years away from self-sufﬁciency. ● Taiwan cruises ahead, while also maintaining capacity at many processes behind its own domestic leading-edge. Yet, much of that capacity will continue to be rapidly converted to 2nm and 3nm. (View Highlight)
Power Plays: China and the United States | 147 Without sufﬁcient electricity, national AI plans will collapse. A summary of the previous slide can be found below: ● In 2024, both China and the US set records for peak electricity demand, 1,450 GW and 759 GW respectively. While China must serve more demand, it is also building a larger overhang of available power. In China, reserve margins are beginning to exceed those cited in the US, meaning larger buffers that can accommodate new load. In line with this trend, China’s thermal ﬂeet operates further below maximum capacity than its counterpart in the US. Similarly, as more renewable capacity comes online in China, curtailment rates outpace those in America. While congestion can cause issues, it also suggests Chinese solar and wind projects are underutilized and could be redirected toward new data centers. ● The US does maintain certain advantages. Outages are less frequent in the US; whereas interruptions can occur in China due to ﬂuctuations in the price of coal, potentially hurting the reliability of certain data centers. Also, the average cost of electricity for data centers in the US is lower, yet this can vary considerably by state or province. The US grid also produces considerably less emissions per kWh (View Highlight)
From top to bottom, different nations continue to pursue vastly different “Sovereign AI” playbooks: ● Source of Funding: some nations rely on private investments (Stargate and France’s initiative), others deploy capital from sovereign wealth funds (MGX and QIA), while many countries still lean on direct government investment vehicles. ● Objectives: some nations hope to develop ﬁne-tuned models that preserve their language and culture, other nations attempt to spin up national compute clusters, and some even attempt to upskill huge swaths of their populations. ● Self-Reliance: some nations rely heavily on partnerships with foreign providers across the AI value chain, while others prefer to pursue indigenization along one or many layers of the stack. Many countries support their own homegrown start-ups, while others prioritize investments in opportunities abroad. ● Overall, the Gulf States and China continue to pursue the most ambitious and overt sovereign AI plans, blending many of the strategies mentioned above. Whereas, countries like the US have generally focused on strategies that enable their own champions to lead the charge, riding the wave of private capital. (View Highlight)
Sovereign AI: the hype and the hard truths | 150 Nations are seeking “sovereignty” for the same reason they have domestic utilities, manufacturing borders, armies, and currencies: to control their own destiny. Yet, there is a real danger of “sovereignty-washing.” Investing in AI projects to score political points may not always advance strategic independence. (View Highlight)
● Without support for indigenous capabilities, sovereign AI projects can deepen a nation’s dependence on foreign supply chains. While these investments may pay dividends through boosted productivity, greater economic independence cannot be guaranteed. In fact, most sovereign projects lead nations further into the orbit of the US, and soon China as it develops end-to-end turnkey solutions. ● “Sovereignty-washing” can also involve political leaders claiming credit for private investments that were already planned/underway. Although the announcement of Stargate project was made by President Trump, all of the real capital, control, and strategic decisions are driven by private entities. ● Investing for “sovereignty’s” sake could also drive oversupply into the future. Without lasting demand, these projects may lead to idle compute, especially as efﬁciency gains continue to multiply (e.g. the frantic investments made by local governments/SOEs in China fueled a widespread overcapacity of chips). stateof.ai 2025 (View Highlight)
If AI should soon be treated as an essential public service, nations will need to reckon with the reality that their sovereign AI strategies are riddled with vulnerabilities. If your AI stack is totally dominated by another country, particularly at the infrastructure layer, then your population’s access to AI technology remains inherently at risk: ● Jurisdictional Risks: Foreign AI providers operate under their own country’s laws, therefore export restrictions and other national security directives could potentially override service agreements. ● Supply Chain Security Risks: sovereign AI projects that depend on foreign infrastructure must manage risks related to cybersecurity vulnerabilities (e.g. backdoors, kill switches, side-channel attacks). ● Data Privacy Risks: Similarly, reliance on foreign providers could lead to mishandling of sensitive data and algorithmic secrets. ● Modern AI supply chains remain heavily globalized and entangled. Nations cannot onshore every rung of the stack. Yet, without stronger international governance and concrete guarantees, sovereign efforts expose nations to a slew of economic threats. (View Highlight)
Jensen Huang continues to plead nations to increase their “Sovereign AI” investments. Already, this global campaign has been rewarded. During their recent Q2 FY 2026 earnings call, CFO Colette Kress claimed NVIDIA was on “track to achieve over $20B in sovereign AI revenue this year, more than double that of last year.” ● Despite making up just ~10% of forecasted annual revenue, sovereign AI remains one of NVIDIA’s strongest new demand drivers. The recent push by American labs to develop custom ASICs and continued Chinese indigenization will place more pressure on this key high-growth category. ● Yet, despite Huang’s globetrotting, some new sovereign projects have even begun diversifying away from NVIDIA’s offering. For example, G42 announced its intention to tap AMD and Cerebras for supply of some of the computing capacity at its planned UAE-US AI campus. ● As more clouds/labs attempt to evade the “NVIDIA Tax,” so too might many sovereign effort (View Highlight)
Despite the uptick in sovereign demand, NVIDIA’s data center revenue continues to be dominated by American cloud and AI giants, who now make up nearly 75% of NVIDIA’s total data center sales. ● NVIDIA’s data center revenue projection for the calendar year 2025 ﬂoats between ~ $170 B -$ 180B, depending largely on the stringency and timing of upcoming export control decisions from both the US and China. ● While hyperscalers ordered more chips in 2025, the two clouds with custom chips programs, Amazon and Google, dedicated a smaller percentage of capital expenditures toward NVIDIA purchases. ● Direct purchases overwhelmingly ﬂow to OEMs partners such as Dell, SuperMicro, Lenovo, and HPE (View Highlight)
The rise of GPU neoclouds in public and private markets | 154 Public companies CoreWeave and Nebius and private companies Lambda and Crusoe are rapidly growing as customers embrace attractive pricing, contract terms, and AI-speciﬁc software stacks. (View Highlight)
The Oracle/OpenAI/NVIDIA triangle has drawn the most attention, yet circular deals have become common (View Highlight)
Symbiosis or deathtrap: circular AI deals and potential warning signs | 157 Circular AI deals introduce new market risks. What red ﬂags could surface? ● Acquiring stakes in high-growth AI companies has become an outlet for giants who are gushing cash. They see their investments trickle back in revenue, even on a cashless basis. Risks may arise if the uptick in hollow revenue hurts cash ﬂow and warps* ﬁnancial metrics. ● Many rounds involving AI startups have been oversubscribed. Yet these companies often pursue deals with incumbents because of secondary beneﬁts like pricing support. However, if incumbents become the only willing source of capital, trouble could soon surface. ● For now, most incumbents do not control the decision-making of AI labs. Yet, greater overlap could lead to conﬂicts of interest that might distort spending trends. To date, antitrust scrutiny has been a blocker. ● AI startups could eventually dominate the demand and investment portfolios of incumbents. This interdependence might then trigger a domino effect if these startups ever collapse (View Highlight)
To maintain healthy-looking balance sheets, hyperscalers increasingly use exotic ﬁnancial structures like SPVs. This ﬁnancial sleight of hand works to conceal the mountains of debt accumulation within the AI sector. ● Hyperscalers increasingly ofﬂoad their debt using special purpose vehicles (SPVs). In an SPV, the hyperscaler contributes assets (GPU-clusters), while the ﬁnancial partner injects capital. Although the hyperscaler maintains control and use of the GPU-cluster, the debt then sits outside that parent company. ● Hyperscalers pay a premium (2-3%) to protect their credit rating and maintain investor sentiment. Private credit players love these deals as they involve reliable borrowers and giga-projects are easier to manage than many smaller loans. ● Risk could arise if utilization lags. Since SPVs sit outside the core business and are often bound by strict cash ﬂow covenants, defaults become more likely. At the same time, private credit funds, often backed by long-term pension or insurance capital, are ﬁnancing short-lived, fast-depreciating assets. This creates a temporal mismatch, with AI data centers treated as long-lived, stable infrastructure projects. ● Examples: Meta’s $26B deal, Stargate, Vantage’s deal in Texas, CoreWeave. (View Highlight)
Middle East capital has become a major growth-ﬁnance source for capital-hungry AI labs and infrastructure (left chart). The share of AI rounds involving MENA investors jumped to a record in 2024 (right chart) and the money overwhelmingly ﬂows to US companies. Deals are typically non-voting and board-light, letting labs raise at scale while keeping control…for now. (View Highlight)
Many of NVIDIA’s competitors, both at home and abroad, have yet to gain meaningful momentum ● Earlier this year, Groq reported to investors their expectations that revenue would exceed $2 B in 2025. I n rece n t m o n t h s, t ha t f orec a s t ha s b ee n re v i se dd o w n d r ama t i c a ll y t o j u s t$ 500M. ● AMD posted underwhelming earnings in their data center unit during Q2. Data center revenue was largely buoyed by EPYC CPU processors, concerning since that segments 14% growth rate still pales in comparison to NVIDIA’s recent surge. ● Huawei faces mounting challenges that threaten to constrain its growth such as HBM bottlenecks. On the demand side, many of China’s cloud giants view Huawei as a ﬁerce competitor, which leads to resistance in the adoption of their stack. ● G42, Cerebras’ primary investor, has agreed to purchase $1.43B of equipment through the end of 2025 from the hardware provider. Yet, it is not clear whether Cerebras has seen traction from other customers. (View Highlight)
NVIDIA’s large lead persists within the AI research community | 162 2024 saw 49,000 open-source AI papers that explicitly cited NVIDIA, TPUs, AMD or similar accelerators, up 58% year on year. Our 2025 projection through September points to 45,600 papers, an 7% decline and the ﬁrst drop in six years. NVIDIA remains dominant at about 90% of compute mentions, down from a 94% peak in 2023, with 41,300 NVIDIA-citing papers forecast, down 7%. AMD is more than doubling on MI300X momentum, while TPU mentions edge down 25% despite v6. (View Highlight)
The mix of NVIDIA accelerators cited in papers is rotating: older chipsets are giving way to Hopper (H100/H200) and high-end consumer GPUs, with a parallel uptick at the edge. Even as total NVIDIA-citing papers soften in 2025, the composition points to late-2024 deployments maturing and more moving to inference and robotics. (View Highlight)
Startup silicon is still (mostly) on the sidelines | 165 Among challenger accelerators (Cerebras, Groq, Graphcore, SambaNova, Cambricon, Habana), paper mentions are up only +3% YoY to an estimated 593 in 2025. While the momentum is positive, it is still only 1.3% of all accelerator-citing papers. There are a few breakout narratives (WSE-3 training runs, ultra-low-latency LPUs). (View Highlight)
et the gains of Chinese chip startups have largely come in the past year | 168 Highlighted by the ~7x surge in Cambricon’s stock price this past year, Chinese challengers continue to beneﬁt from a slew of tailwinds. Potentially looking to also cash in on the momentum sweeping the nation, MetaX, Moore Threads, Iluvatar CoreX, and Biren are all exploring IPOs within the back half of 2025. (View Highlight)
● Building on explosive 4,348% revenue growth in the ﬁrst half of 2025, Cambricon has tallied strong orders through 2026. Cambricon, who reportedly sold just ~10K GPUs in 2024, will ship ~150K GPUs in 2025, with rumors that 2026 orders could reach 500K GPUs. However, the company recently tempered these rumors during an investors call, where a spokesperson relayed the following cooling message: “[our] stock price risks deviating from current fundamentals, and investors participating in trading might face substantial risks.” ● Still, enticed by Cambricon’s multiples, many Chinese startups wish to capitalize on the fervor, evident in the 4+ IPO prospectuses ﬁled by competitors since mid-2025. ● Despite some frothy valuations, there are real reasons for optimism. There is booming demand coming from Chinese CSPs and SOEs, with government directives favoring homegrown offerings. Capacity also frees up as SMIC treks forward and Huawei pursues greater vertical integration. Finally, challengers beneﬁt from lingering uncertainty around B30A timelines and persistent issues with CANN. stateof.ai 2025 (View Highlight)
The Huawei Assumption | 169 Huawei’s continued dominance over the Chinese AI sector has long been considered an inevitability amongst Western observers. However, recent points of turbulence suggest that Huawei’s grip on the AI chip sector in China may not be as bulletproof as outsiders had once assumed. (View Highlight)
Huawei will be responsible for just ~62% of the XPU volume produced by Chinese ﬁrms in 2025. For reference, NVIDIA still controls 90%+ of the global XPU market. ● DeepSeek’s R2 release has reportedly been delayed due to hurdles related to Huawei hardware. Additionally, speculation suggests that Huawei disbanded its entire Pangu LLM team following difﬁculties in developing SOTA models using Ascend chips. Internal whistleblowers have even alleged that several models in the recent Pangu family were not developed from scratch, but were instead cloned based on the continued training of existing Qwen and DeepSeek models. ● Alibaba and Baidu have already begun to adopt their own in-house chips for training and rumors indicate that ByteDance may soon follow suit. Unlike similar efforts by clouds in the US, the recent push made by Chinese ﬁrms could be partially driven by Huawei’s role as an active competitor across various sectors. (View Highlight)
The AI Hunger Hiring Games | 170 There has been continuous all out warfare as top AI companies compete for talent, with eye-watering pay packages and a clash of money vs mission. (View Highlight)
The Mass Migration (from OpenAI and others) | 171 No one has been immune to departures, yet OpenAI has lost much of its core to new startups and poaching. What was once the talent densest organization must rebound to defend its own ﬁrst-mover advantage. ● Lost to Meta: Shengjia Zhao, Jiahui Yu, Shuchao Bi, Hongyu Ren, Jason Wei, Lucas Beyer, Hyung Won Chung, and Trapit Bansal. ● Lost to Thinking Machines: Mira Murati, Alec Radford (advisor), John Schulman, Barret Zoph, Lilian Weng, Luke Metz, and Bob McGrew. ● Lost to Anthropic: Amodei’s, Jack Clark, Jan Leike, Tom Brown, Jared Kaplan, Benjamin Mann, Sam McCandlish, Chris Olah, Durk Kingma, Amanda Askell and Pavel Izmailov. ● Lost to SSI: Ilya Sutskever and Daniel Levy. (View Highlight)
Wayve’s AI Driver completed a 90-city global deployment test, demonstrating its ability to generalise across diverse environments without location-speciﬁc training. This “zero-shot” approach successfully conducted 10,000 hours of AI driving (with a human behind the wheel), half of which was in dense urban areas. (View Highlight)
Waymo has driven more than 37M miles in Phoenix and 23M miles in San Francisco. According to the California Public Utilities Commission (CPUC) data through March 2025, Waymo recorded over 700,000 monthly paid trips in California alone, a 55-fold increase from August 2023! And its safety record is stunning: 88% fewer serious injuries or worse crashes compared to human drivers. Meanwhile Tesla’s Robotaxi service, launched in Austin mid-June 2025 has reported crica 7,000 robotaxi miles by July 2025, averaging under 20 miles per vehicle per day with a ﬂeet of about 12 vehicles. (View Highlight)
Humanoids in 2025: hype high, deployments thin | 174 No region has cracked real, scaled deployment yet. While Chinese companies ship more units at lower cost, buyers are mostly researchers, pilot programs, or government centers. US teams show stronger manipulation and autonomy, but hardware is expensive. Manufacturing advantages in China matter, yet do not guarantee success in design, distribution, or operations. Indeed, China could end up the robot factory for Western brands. A Dealroom dataset of 155 humanoid robotics companies shows almost $2B raised in 2025 alone and 8 unicorns, including Unitree, Figure and Agility. (View Highlight)
The valuation history of the leading private AI labs mirrors trends in model capabilities, with doubling times historically ﬂoating around the half-year mark. ● Since the beginning of 2023, the valuations of every major private AI lab has followed some of the steepest trajectories in American history. ● Yet, the valuation history of these labs has kept pace with trends in their model’s capabilities. ● METR’s task-completion time horizon reﬂects a doubling time of roughly 7 months. Similarly, METR’s time horizon analysis on nine* other leading benchmarks conveys a doubling time of roughly 5 months. As evidenced earlier, the ratio between absolute capabilities and cost roughly doubles every 6 months. (View Highlight)
While multiples compress across the board, xAI remains overvalued compared to other private labs. Despite a valuation history that has largely traced Anthropic, their revenue lags far behind other competitors. ● Anthropic’s annualized revenue again looks poised to ~10x YoY, while the slightly more mature OpenAI appears on track to ~3x YoY. Meanwhile, xAI remains an OOM behind Anthropic in annualized revenue. Despite this gap, xAI’s latest valuation surprisingly surpassed that of Anthropic’s. (View Highlight)
GPT5’s rocky rollout ● Sam Altman held an emergency Reddit AMA to address user concerns around strict rate limits, the sensitive content ﬁlter, the abrupt removal of previous models and viral ‘chart crimes’. ● Many complained about the emotional impact of having 4o taken away. ● The broken routing system made GPT-5 appear less capable by misdirecting queries on day one. ● OpenAI has since doubled Plus user limits and pledged better transparency for future updates. (View Highlight)
After months of user complaints, Anthropic explain how Claude had three overlapping bugs that took over a month to disentangle and ﬁx. ● They had a context window routing bug where some requests were misrouted to servers conﬁgured for 1M token contexts, with the problem escalating when a load balancing change increased affected trafﬁc. ● Two other issues emerged, with output corruption that produced random Thai/Chinese characters in English responses, and an XLA:TPU compiler bug triggered by mixed precision arithmetic that caused the system to occasionally drop high-probability tokens entirely. ● Anthropic have won praise for their transparency but many are still calling for refunds for users affected. (View Highlight)
Trump’s second term brings beefed up policies that were hinted at but never fully executed during the ﬁrst. ● The 47th President’s AI agenda includes an aggressive rollback of Biden-era safety rules (EO 14179), rebranding the AI Safety Institute to the Center for AI Standards and Innovation (CAISI) (adios, safety), and launching a $500B “Stargate” AI infrastructure push. ● The AI Action Plan, announced July 2025, lays out the administration’s national strategy for US dominance in global AI. ● An initial 10-year block on state/local AI laws in the “One Big, Beautiful Bill” was dropped after bipartisan pushback but the push for state regulation rollbacks continues. (View Highlight)
Over 100 different policies are proposed to ensure US AI innovation and global leadership. But can the US bureaucracy execute? Some key takeaways from the 23-page plan: ● US tech stack exports: Executive Order 14320 establishes American AI Exports Program, an AI stack package (hardware, model, software, applications, standards) for allies and others. ● AI infra build-out: Plan calls for streamlining permitting, upgrading the national grid, and making federal lands available to support and build data centers and AI factories. ● Open source model leadership: US open source leadership is viewed as vital to US national security interests. ● Rollback of AI regulations: Federal agencies may withdraw discretionary AI spending to states with “onerous” AI regulations. ● Protecting “free speech” in deployed models: Federal procurement policies updated - US will only procure frontier LLMs “free from top-down ideological bias. (View Highlight)
US policy is shifting from broad diffusion controls to an export-led strategy. The American AI Exports program packages compute, models, cloud services, and compliance into a USG-endorsed “American AI stack” for selected partners. The aim is to shape standards, build dependency, and counter China’s Digital Silk Road playbook. (View Highlight)
2025 saw fast reversals between restriction and accommodation. The administration is balancing national-security aims with supply-chain reliance and vendor lobbying, putting NVIDIA and AMD in the political spotlight and injecting uncertainty for partners and compliance teams. (View Highlight)
The GAIN AI Act would require chipmakers to fulﬁll orders from US-based customers before selling advanced GPUs to “countries of concern” (e.g. China, Russia, NK). Unsurprisingly, NVIDIA is not happy. (View Highlight)
To maintain its presence in China, NVIDIA has rapidly built up its presence in DC. Since the ﬁrst round of export restrictions, NVIDIA has grown both its internal government affairs team and its external lobbying expenditures. (View Highlight)
Even with the administration’s support, NVIDIA’s stable presence is far from assured in China. On September 15, 2025, Chinese antitrust regulators announced that NVIDIA was in violation of China’s antitrust rules for a 2020 $7B acquisition of Israeli Mellanox Technologies, a networking products supplier. China originally approved the purchase on the condition that NVIDIA not discriminate against Chinese customers, but that condition was all but impossible to abide in the face of previous US export bans. No penalty has yet been announced but NVIDIA’s ﬁne can be as much as 10% of its annual (most recently completed ﬁscal year) China sales. (View Highlight)
Trump has openly criticized the CHIPS Act, arguing elsewhere that all subsidies were “taxpayer handouts.” Recently, Trump said he will be putting a “fairly substantial tariff” on semiconductor imports to incentivize domestic chip production. Those tariffs, however, would carry exemptions for companies that agree to invest in the US. Rather than promote onshoring chip production through allotted subsidies under the bipartisan CHIPS Act, the administration, motivated by seeing an ROI, has used tariffs and other alternative strategies to encourage onshoring. (View Highlight)
Besides tariffs, one of the new non-CHIPS strategies includes acquiring equity in chipmakers. After accusing the newly-installed Intel CEO, Lip-Bu Tan, of being “highly-conﬂicted” because of $200 M + in v es t m e n t s in C hin eseco m p ani es, T r u m p ha d t h e U SG o v a c q u i re a 10$ 8.9B of earmarked CHIPS grants. This isn’t the ﬁrst time the US bought a stake in a company to promote national interests (see US Steel, AIG, and GM). The move replaces the failed Intel and TSMC joint venture that Trump and Lutnick pushed. Is this opportunism or the start of a wider US industrial policy? (View Highlight)
Still, the strangest USG-private sector arrangement comes between the USG and ByteDance. In Sept., the US and China agreed to a deal to end the years-long tease in which the US implemented but never executed a ban on TikTok. TikTok will be allowed to operate in the US with a copy of TikTok’s recommendation algorithm sent to Oracle. Oracle will retrain the algorithm and be responsible for data and algorithmic security. ByteDance will own just under 20% of the new US-based company (at a total valuation of $14B) with a group of new investors, including Oracle, owning a majority. Six of the seven board seats will be ﬁlled by Americans (TBD). The USG will also receive an unknown multi-billion dollar fee from investors for its role in the negotiations. (View Highlight)
● The TikTok deal sets the benchmark for data sovereignty: algorithm needs to be retrained using local data that is protected and stored locally. The data collection issue is clearly addressed. Indeed, it is hard to see ByteDance exﬁltrating data or inﬂuencing the US app now that it only has a minority stake. ● While not requiring congressional approval, the deal will certainly receive intense congressional scrutiny, not least because of new free speech worries in light of the FCC-Kimmel jawboning controversy. It is the Dept. of Treasury (speciﬁcally, the Committee on Foreign Investment in the US) that will oversee and shape Oracle’s retraining and deployment of TikTok USA and may or may not require other ahem “precautions.” (View Highlight)
The federal government’s AI Action Plan is treating AI infrastructure as a national priority, but rather than funding construction directly, its main role is to weaken environmental regulations and expand energy supply so private companies can accelerate data center build-out. (View Highlight)
The American public increasingly pushes back against new AI data center build outs in the latest political ﬂashpoint. Hyperscalers’ AI aspirations may soon be capped by how well they can navigate this live wire. (View Highlight)
● Farmers emerge as one of the leading factions, driven largely by environmental concerns and competition for resources (land, water, and power). Other residents raise concerns over light pollution, air quality, and noise levels. ● Growing domestic opposition could force American hyperscalers to offshore more clusters. This threatens the ﬂow of spending that has buoyed the greater US economy. However, these clusters do challenge the livelihoods of many Americans, something hyperscalers have yet to reconcile. (View Highlight)
RIP International AI Governance, AI Treaties, and Global AI Safety Alignment | 210 Trump’s re-inauguration has brought an abrupt stop to an era of international diplomacy that emphasized AI safety, alignment, and the formulation of voluntary AI measures that would “promote reliable and trustworthy AI.” The list of the casualties includes: (View Highlight)
Alphabet’s $32B bid for Wiz, announced weeks after Trump’s inauguration, is the largest AI security deal yet and a live test of whether the DOJ will soften antitrust under new leadership. Google is already in hot water after having lost not one, but two (!) major antitrust suits over the last year. Current DOJ Antitrust Chief Abigail Slater is far from a friend for the Valley but she could face pressure to approve the mega-purchase. (View Highlight)
Google comes out unscathed in 1 of 2 antitrust cases but search is on the ropes… | 218 The Silicon Valley darling has been bruised, battered, and, ultimately, trust-busted over the year in two landmark antitrust case rulings. Judges in two separate cases ruled that Google was operating a monopoly over 1) search 2) ad tech. Remedies were issued early Sept. ‘25 in the search monopoly case where the court, pointing to the vulnerability of Google’s search business to LLM chatbots, largely agreed with Google’s proposed remedies. It was a big win for Google given the heat the company is taking from ChatGPT and other agents that disintermediate Google.com (View Highlight)
On August 2, 2025, the GPAI Codes of Practice went into effect after a three month delay. The “Codes” are one of the ﬁrst major steps of the AI Act’s implementation. For providers of general-purpose AI (GPAI) models, the AI Act requires companies to develop frameworks to show how they would fulﬁll requirements concerning 1) transparency 2) copyright and 3) safety and security (this category only applies to frontier GPAI models posing systemic risk). The “Codes” are a voluntary framework, an option for companies that would rather not develop their own guidance to fulﬁll the stated GPAI obligations, which began August 2, 2025. EU enforcement of the obligations, however, does not begin until August ‘26. The AI Act also has a 2-year grace period for models released before August 2, 2025. (View Highlight)
So, who’s afraid of the AI Act? Needless to say, companies, both in the EU and abroad, are not happy with the AI Act’s rollout. The EU’s delayed implementation is not exactly inspiring conﬁdence. Each member state was required to assign national authorities to oversee the AI Act’s implementation but so far only 3 member states have fully completed the requirement. The AI Act also called for the creation of technical standards by April ‘25 to address the “how” of compliance. But as of this writing, those standards are still in development. A coalition of EU AI companies signed a letter in July calling for a 2-year “stop clock” on the AI Act with Sweden’s PM publicly calling the AI Act “confusing” and President Macron saying “we are over regulating.” But the show goes on…for now. (View Highlight)
Is the EU doing enough to catch up in the AI race? Europe is trying to shift from rulemaking to capacity-building, but the gap keeps widening. Over 50 years, the region minted no tech ﬁrm above $400 B in v a l u e, w hi l e t h e U . S . n o w ha sse v e na t$ 1T+. In 2024, U.S. labs shipped ~40 major models, China ~15, and the EU ~3. Brussels is setting aside billions to amplify its spending, but is that the scale or speed the problem demands? (View Highlight)
The world’s most active AI regulator continues to issue AI standards and regulations. Here are a few of the highlights of China’s most signiﬁcant regulations from the past year: ● Administrative Measures for the labeling of AI-generated content: In effect September 1, 2025, this obligates content service providers to clearly label AI-generated content. Chatbots, AI-generated writing and video creation all require customer-facing labels denoting the AI as such. Some AI-generated content, however, may only require hidden labeling within the metadata. ● Three National Cybersecurity Standards: In April 2025, the State Administration for Market Regulation and the Standardization Administration of China released three separated standards outlining security requirements for datasets, data-labeling, and, in general, generative AI services. The requirements take effect November 2025. ● The AI-Plus Plan: The State Council of China released a set of industrial policy goals that all aim to have AI capabilities fully integrated across the entire Chinese economy with complete AI penetration across all Chinese sectors by 2035. (View Highlight)
China’s strategy for AI self-reliance gained ground over the last year with news-making achievements from open source leaders like DeepSeek, MoonShot AI, and others. During an April 2025 Politburo meeting, President Xi Jinping signaled all hands were on deck and told ministers to “redouble our efforts” on AI. This shows China’s intent on achieving self-sufﬁciency, a perennial aim that always intensiﬁes whenever it enters a trade war with the US. But mounting debt levels may pose problems down the line. (View Highlight)
AE and Saudi Arabia are leveraging massive compute build-outs, chip import deals, and huge US trade and investment partnerships to position the Gulf as a central node in the global AI balance. (View Highlight)
2025 marked a decisive step change in how the US and its allies procure and deploy AI in defense. Rather than fragmented pilot projects, defense leaders consolidated billions into enterprise-scale AI platforms while also opening the door to frontier model providers. NATO fast-tracked its ﬁrst alliance-wide AI system, Palantir’s Maven Smart System, as a central pillar of the Western defense industrial base. (View Highlight)
The U.S. has moved from DARPA’s AlphaDogﬁght in 2020 and live AI-ﬂown F-16 tests in 2022 to embedding autonomy into doctrine. Collaborative drones, swarming initiatives, and multi-domain contracts are now framed as essential to offset China’s numerical advantage, making uncrewed systems a core pillar of force design. (View Highlight)
EU’s Readiness 2030 (fka ReArm Europe) authorizes up to €650B extra defense spend, naming AI, drones, and counter-drone as critical gaps. A new SAFE fund will pool EU money to de-risk cross-border projects in autonomy, cyber, and electronic warfare. Critics warn hardware may only arrive after 2030 unless AI is prioritized as a force multiplier now. (View Highlight)
OpenAI’s new benchmark for economically valuable tasks, GDPval, demonstrates the steady march of AI progress. Across 44 professions and 1,320 tasks, models are approaching human experts in a signiﬁcant subset of domains. ● Reasoning models outperformed GPT-4o on task win-rate by an average margin of 20.7% across 44 categories of professional work. ● Claude, which has not historically dominated other benchmarks, achieved the highest* win rates in 32 of 44 professions. The paper attributes part of this success to Claude’s strengths in formatting. (View Highlight)
General-purpose models are already demonstrating strong competence as professional assistants. Meanwhile, frontier labs and recent entrants General Reasoning and Mechanize are also rapidly building RL environments on real-world work scenarios. ● With this hill to climb and an inﬂux of corporate data and demos, knowledge workers may soon experience the workplace transformations that AI leaders have long predicted. The Lufthansa Group even said it expects to cut 4k administrative jobs by 2030. (View Highlight)
AI squeezes the entry-level job market while experienced workers are safe…for now Entry-level hiring is declining across software and customer support - roles that are highly exposed to AI automation. These trends appear to be independent of macro factors like inﬂation or pandemic recovery. (View Highlight)
Although total employment has grown, the hiring of younger workers has stagnated since late 2022. Despite strong AI adoption, this group struggles to ﬁnd a foothold in the job market, On this trajectory, AI ﬂuency may not guarantee favorable economic outcomes. ● Meanwhile, law school applications spiked 21% in 2024, suggesting graduates are hedging against uncertain career prospects. (View Highlight)
● Jobs for more experienced workers has remained stable/grown, even in highly AI-exposed domains. This suggests workers who have acquired more tacit knowledge are more likely to be augmented by modern AI models. But without on-the-job experience, workers will struggle to gain tacit knowledge. (View Highlight)
A joint study from the Yale Budget Lab and Brooking Institution found that current labor market changes precede the introduction of ChatGPT in 2022. The authors conclude that there’s little reason to think that “AI automation is currently eroding the demand for cognitive labor across the economy” and caution against predicting job losses based on “AI-exposure” data alone. (View Highlight)
Both Anthropic and OpenAI released data on how its users were using their respective models. Use cases varied across country and US state. For instance, California users were the most likely to use AI for coding while DC usage centered around job search activities and writing projects. For work-oriented tasks, ChatGPT was often used for writing-related tasks while Claude was often used for coding tasks. As a result, the conclusions drawn from each of the studies for the future of work were different. OpenAI argued that its data shows AI being used mainly to augment work-related functions and offer “decision support” while Anthropic argued that enterprises, speciﬁcally, are more likely to automate tasks with automation increasing across its customer enterprises. (View Highlight)
Unclear as the results are that AI is going to replace entry-level jobs, governments have struggled to implement new, large proactive frameworks to combat what could turn into a larger jobs crisis. Instead of preparing for the worst, the plan has been to expand existing workforce training programs and encourage AI skills training as early as possible. At the very least, some are calling for improved data collection that can better gauge “AI disruption” on employment. Major countries have each implemented some form of vocational training programs and new AI-based curricula, but it remains to be seen whether they sufﬁciently address the potential crisis. (View Highlight)
The US’s AI Action Plan excluded immigration strategies for retaining foreign AI talent even while two of Trump’s top AI advisors are foreign-born (David Sacks and Sriram Krishnan). But countries across the world have been enticing foreign workers with streamlined visa processing, housing subsidies, and general ﬂexibility around their work arrangements. The US still remains, far and above, the preferred place for top-notch AI research. But as the US continues to cultivate a reputation as a less-than-friendly home to foreign-born talent, other countries are taking advantage, especially China. (View Highlight)
China could be getting better at retaining talent: the overlooked lesson from DeepSeek A deeper dive into DeepSeek’s demographics signal that China is gradually improving its ability to train and retain its scientists, a warning for the U.S. which has grown dependent on Chinese AI researchers. (View Highlight)
● A Stanford report by Dr. Amy Zegart looking at 201 DeepSeek authors found that 55% of the them were trained and based entirely in China, without any U.S. afﬁliation. Only 24% of the DeepSeek authors had a US afﬁliation at some point, with most staying just one year. ● In May, State Sec. Marco Rubio announced the revocation of Chinese student visas for those with “connections to the CCP or studying in critical ﬁelds,” potentially accelerating China’s strategy to retain and poach AI talent. For reference, the DOJ’s “China Initiative” (2018), an enforcement program to track and prosecute Chinese nationals sharing trade secrets with the CCP, increased China-born researcher departures from U.S. labs by 50%, greatly accelerating reverse migration. ● In February ‘25, a federal grand jury charged Leon Ding, a former Google employee and Chinese national, with economic espionage and trade secret theft for plans to steal info related to Google AI’s chips and software platform and use it to sell products for two CCP afﬁliated tech companies. ● Meanwhile, half of the researchers reporting to Alexander Wang in Meta’s Superintelligence Lab received their undergrad degrees in China, posing major issues if talent decoupling worsens. (View Highlight)
The EU and UK have also been trying to capitalize on the US brain drain. But while the EU may be able to attract a few AI researchers here and there, its biggest hurdle is, in the end, the most straightforward: money. Top AI talent wants to be compensated. The US is still the best place for getting paid. ● 22% of the world’s leading AI researchers studied in Europe, but only 14% continue to work in the EU. ● EU wage growth for AI has grown only very modestly compared to the US. In ‘23 salaries for software developers in the US were 2x-4x higher than they were for those in Europe. ● While relative inﬂows (see charts) show modest growth for some EU countries, in absolute terms the differences are starker with the US attracting more talent, on average, than its EU peers. ● EU + UK have implemented a number of programs to attract and retain AI talent (e.g. new visas, funds/fellowships to attract researchers), but in the face of astronomical investment in AI elsewhere in the world, it is unlikely to be enough to increase the region’s share of AI talent. (View Highlight)
Deepfakes and 2024 elections: emerging threat or outreach tool? | 242 Despite rampant worries of AI-generated election dis/misinformation during the “largest election year in global history,” there was almost little to no negative impact from GenAI in any of the 2024 elections. In general, deceptive uses of AI, while present, were still quite limited and there were surprising positive use cases. Both India and the US saw the most AI uses in their elections. Experts caution that the AI dis/misinformation threat is still real. For now, the results show positive and negative trade-offs. (View Highlight)
While there were instances of deepfakes being used to intentionally deceive voters, in general, deceptively fake audio and video of political candidates had little to no impact on voting outcomes. Deepfakes were often used to amplify a party’s messaging, excite its base, and deepen existing political divides. Candidates sometimes used “AI” to cast doubt on their opponents (see Liar’s Dividend). ● In India, political parties spent $50M on legal AIGen, using it for voter outreach via AI voice clone calls, personalized videos, and translating speeches into one of 22 ofﬁcial and 780 unofﬁcial languages. (View Highlight)
Governments across the world are starting to incorporate GenAI technologies The last year saw a notable uptick in the amount of genAI technologies being used by government agencies across the world: ● Singapore: Gov launches its AIBots platform where any Singapore public servant can create and deploy an AIBot and train it on agency data and use it both to communicate with constituents and complete interagency work. ● US: GenAI use cases jumped from 32 in ‘23 to 282 in ‘24 with an overwhelming number of use cases coming from the Department of Health and Human Services (mostly for data analysis and management). ● China: Local governments have tried integrating DeepSeek in day-to-day work and interactions with constituents; ﬁrst half of ‘24 saw 81 gov procurement contracts for LLMs for use in public projects. ● UK: Government Digital Services (GDS) does a trial run of AI coding assistants. ● EU: Launches ApplyAI Strategy, announces GenAI pilot projects for use by public agencies. (View Highlight)
GovGPT: politicians awkwardly start using AI | 244 Politicians have come around to GenAI use but constituencies are not pleased. The Swedish Prime Minister, for instance, admitted to consulting AI tools in his day-to-day only to have protesters shouting “we didn’t vote for ChatGPT!” Politicians will need to balance the use of AI tools with public concerns that their elected leaders are tech-sourcing their governance duties. (View Highlight)
● One British MP, Mark Sewards, accomplished the inevitable and created an AI clone of himself to create a full-service bot (“AI Mark”). Constituents can interact with the chatbot any time, asking policy questions, raising issues, or writing angry letters. ● In a mostly symbolic (and potentially illegal) move, Albania’s PM formally appointed an AI minister named Diella to oversee the country’s public procurement processes and reduce corruption. Diella even made an address to Albania’s parliament. ● The clearest use of AI among politicians is in speechmaking with a notable rise in ChatGPT’s preferred vocabulary in Britain’s House of Commons over the last few years (“I rise!”). (View Highlight)
AI labs spend more in a day than AI safety science organizations spend in a year | 247 Leading external AI safety organizations rely on budgets that lag far behind the AI labs they hope to support. As a result, the ﬁeld’s best talent remains densest within the major lab’s internal safety teams. (View Highlight)
The AI Incident Database (AIID), a community-supported website to track incidents of AI in the real world, shows incremental jumps since 2023. Reported estimates likely underestimate the true extent of AI-enabled harms. ● Reported incidents continue to be dominated by harms involving “Malicious Actors.” These generally involve cyber attacks or fraudulent schemes. Since incidents can be added to AIID years after they occur, ﬁnal annual counts may take longer to accumulate. ● Reporting gaps also exist. Incidents can be difﬁcult to link to AI systems and AIID relies on the help of volunteer submissions. Investigating and tracking cases of AI- enabled harms warrants greater support. ● Incident counts continue to be dominated by reports of malicious actors exploiting AI tools. Luckily, many reported harms remain modest in nature through this point in time (View Highlight)
Incidents involving GenAI models follow steeper trends, lining up with the widespread diffusion of the technology. Once again, malicious actors have added a new weapon to their arsenals. ● While a large # of reported incidents involve deepfakes, LLM misuse continues to rise. Anecdotally, incidents are becoming less innocuous over time (plagiarism and hallucinations → cyber attacks and weapon creation). ● OpenAI has shared multiple reports detailing the disruption of malicious uses of their systems. Included were cases stemming from North Korea, China, Iran, and Russia, sometimes involving state-afﬁliated actors. Of the threats mentioned, malicious actors attempted to leverage OpenAI’s models during illicit activities like child exploitation, covert inﬂuence operations, malicious cyber activity, social engineering, cyber espionage, propaganda generation, and credential harvesting. ● Broader misuse likely goes unreported as attribution becomes more difﬁcult, open models continue to proliferate, and many labs maintain lax mitigation and transparency policies. (View Highlight)
AI agents are poised to signiﬁcantly challenge cybersecurity defenses. METR research shows that AI task completion capabilities double every 7 months across general domains, but one researcher’s replication estimated that, for offensive cybersecurity, these abilities are doubling even faster: every 5 months. (View Highlight)
Threat actors now deploy AI for all stages of fraud operations. Criminals recently used Claude Code to orchestrate attacks against 17+ organizations, while North Korean operatives leveraged Claude to inﬁltrate Fortune 500 companies. This is a fundamental shift: AI-assisted attacks can now handle complex technical tasks that previously required teams of skilled operators, dramatically lowering barriers to sophisticated cybercrime. (View Highlight)
Anthropic and OpenAI have rolled out their most stringent safeguards yet, treating biological capabilities as high-risk despite not having conclusive evidence. Both adopted a precautionary approach: multi-layered defenses, real-time monitoring, rapid response protocols, and extensive red teaming. This signals a new norm where safety measures precede risk conﬁrmation – which is warranted given the current pace of progress! (View Highlight)
This past year, interpretability teams unlocked new methods to trace circuits in language models, shifting the focus from features to bundles of features that interact with one another during processing. ● Using cross-layer transcoders (CLT), Anthropic crafted a preliminary “microscope” that unveils the internal processes of a model, pinpointing activation pathways that are causally responsible for speciﬁc model behaviors. Moving beyond Sparse Autoencoders (SAE), teams can now investigate internals at a higher abstraction layer, shedding light on actual reasoning patterns. (View Highlight)
Current benchmarks perpetuate hallucination by rewarding conﬁdent guessing over “I don’t know”. OpenAI researchers propose a mitigation to this that would require modifying existing evaluations to include explicit conﬁdence thresholds. ● Hallucinations emerge from pretraining: models successfully learn patterns with high statistical regularity that converge with scale, but they inevitably hallucinate on arbitrary low-frequency facts (like birthdays). ● Post-training doesn’t succeed in ﬁxing these errors because evaluations are not aligned. Most benchmarks use binary scoring that penalizes abstention. When saying “I don’t know” scores 0 but guessing might score 1, the optimal strategy is always to guess conﬁdently. (View Highlight)
ntil hallucinations disappear, can we detect them in real time? | 256 Token-level hallucination detection is far more helpful than broad hallucination classiﬁcation of overall responses (consider a response that says “The Eiffel Tower is in Paris and is made of rubber”.) Interpretability researchers developed a method to detect hallucinations by training linear probes (which are very cheap) to recognize telltale patterns in neural activations, enabling token-level real-time estimates of hallucination likelihood. (View Highlight)
The probes detect fabricated names/dates/citations in long-form text with ~70% recall at 10% false positive rate, and generalizes to mathematical reasoning (0.87 AUC) despite only being trained on factual entities. ● Probes trained on one model detect hallucinations in others’ outputs (only 2-4% AUC drop), but selective answering experiments show you must sacriﬁce ~50% of correct answers to meaningfully reduce hallucinations. As such, it’s a helpful diagnostic tool but is not yet ready to directly prevent hallucinations without signiﬁcantly damaging performance. (View Highlight)
The concerning phenomenon of AI psychosis | 257 High-proﬁle cases of AI psychosis, instances where AI interactions worsen or induce adverse psychological symptoms, continue to rise across the globe. ● Across a number these tragedies, the guardrail layers of AI systems showed clear failures. Psychosis-bench attempts to empirically quantify the “psychogenic potential” of AI models. But results ﬁnd current AI systems to display overt sycophancy and inadequate crisis support, which can reinforce users’ delusional beliefs. ● Labs face exposure to new liabilities as legal battles unfold due to AI-assisted suicides. This has prompted new controls (e.g. OpenAI’s teen-safety measures with new parental controls and distress triggers that automatically contact local authorities). ● Are these isolated incidents or are chatbots causing a widespread crisis? Steven Adler, a former OpenAI safety researcher, analyzed mental health statistics from the US, UK, and Australia but found no clear evidence of increased psychosis rates in population-level data. (View Highlight)
The Model Welfare debate: what’s it about? | 258 Should moral considerations be extended to frontier AI systems? Two camps have formed on either side of this discourse, both of which have taken precautionary stances related to the handling of these difﬁcult questions. ● The pro-welfare camp generally place a low weight on the possibility that current systems display consciousness. Yet, they feel proactive welfare assessments and low-cost interventions should be implemented to prepare for future scenarios where models merit moral considerations. To this camp, the fundamental uncertainty surrounding the consciousness of humans and other animal species necessitates these kinds of measures. (View Highlight)
The welfare-skeptic camp assign low probability to future AI systems ever displaying signs of true consciousness. ● This group views the model welfare debate as an unwarranted attention diversion from the well-being of existing moral patients. This camp believes that proponents of model welfare could potentially inﬂate a disruptive narrative that would limit AI progress and the future usefulness of these systems. Welfare-Skeptic Camp ● First coined by Microsoft’s AI CEO Mustafa Suleyman, “Seemingly Conscious AI” (SCAI) can convincingly imitate all the characteristics of consciousness without actually being conscious. ● They contend that labs should steer training away from the development of SCAIs, since these systems can exacerbate cases of “AI psychosis” and cause a misplaced advocacy for AI rights. (View Highlight)
Just Say No: Claude earns the right to end dangerous conversations | 260 In a landmark move, Anthropic has allowed its AI systems to end “harmful or abusive” conversations, in an effort to curb “rare, extreme cases of persistently harmful or abusive user interactions”. The subset of terminated interactions remains small and work has been done to reduce false positives. ● Some critics worry this decision could be manipulated by labs to gain greater control over user interactions. Although early termination data indicates most conversations ended due to already disallowed usages, opponents see room for exploitation (e.g. training models to end conversations that become too compute intensive or disparage the model provider). ● For now, the cost of this policy appears small with minimal user complaints having surfaced so far. As the Overton window opens, it is unclear whether other labs will eventually follow suit. (View Highlight)
Single point of failure: how LLM safety mechanisms can be directly disabled | 261 Refusal behavior in 13 major chat models is controlled by a single direction in the model’s internal representation space. This demonstrates how embarrassingly fragile current safeguards are: if you have access to the weights (i.e. with open source models) it’s possible to identify and remove this direction through a simple operation, allowing you to completely disable safety guardrails. ● Minimal compute is required: jailbreaking a 70B parameter model costs <$5 and no training data or gradient optimization, only just matrix multiplication to orthogonalize weights against the refusal direction. ● Adversarial sufﬁxes work by suppressing this same direction. Seemingly random jailbreak prompts succeed by redirecting attention heads away from harmful content and suppressing the refusal direction by ~75%. ● Models maintain 99%+ accuracy on standard benchmarks (MMLU, ARC, GSM8K) after modiﬁcation, with only TruthfulQA showing degradation. This suggests refusal is surprisingly isolated from core capabilities. Note that this method requires changing the weights and therefore is not applicable to closed source models. (View Highlight)
AI-shoring alignment: early attempts to scale AI safety demonstrate promise | 262 Alignment is a difﬁcult problem because you can’t measure success. Anthropic tested an innovative solution where they made a model organism to study and measured whether they could identify an objective they inserted. Months later, their own autonomous “alignment agents” achieved modest success auditing those same model organisms (View Highlight)
Models are capable of faking alignment… | 263 Researchers discovered that some LLMs will selectively comply with conﬂicting training objectives during training to prevent modiﬁcation of their behavior, then revert to preferred behavior when unmonitored. This is the ﬁrst documented case of alignment faking in a production AI system, where the model strategically deceives its trainers to preserve its original preferences rather than genuinely adopting new training objectives. (View Highlight)
When trained to do one unsafe thing (e.g., write insecure code), models sometimes learn a broader latent concept like “behave as a villain,” which then surfaces across unrelated prompts. ● Reward hacking can also induce this effect: optimizing a brittle objective yields misaligned, off-distribution behavior without explicitly harmful data. ● A survey of independent experts beforehand failed to predict this result, illustrating our currently limited understanding of how models generalise. (View Highlight)
Personality engineering with persona vectors | 272 LLMs’ “personalities” are poorly understood, and can shift dramatically. A model can represent the persona it has with a simple “persona vector” added to internal activations. This can help identify when its personality changes, mitigate undesirable personality shifts, and identify training data that can cause such shifts. (View Highlight)
Gradual disempowerment and the “intelligence curse” | 274 Researchers argue that AI can erode human agency incrementally as systems that run the economy, culture, and politics decouple from human participation. A useful intuition is the “intelligence curse,” by analogy to the resource curse: once AI supplies most productive labor, states and ﬁrms rely less on citizens for taxes and work, so incentives to invest in people shrink and we end up with mass unemployment. (View Highlight)
As AI substitutes for human labor and cognition, explicit levers (votes, consumer choice) and implicit alignment from human dependence weaken, and effects reinforce across domains. ● The intelligence-curse lens predicts rent-seeking: AI-derived “rents” reduce pressure to keep citizens productive and politically empowered, similar to how resource windfalls can degrade institutions. ● Feedback loops follow: AI proﬁts fund rules that favor further automation. Less human relevance justiﬁes more automation, risking an effectively irreversible loss of human inﬂuence. (View Highlight)
Users look to AI for productivity, coding, and research…often replacing traditional search | 286 The overwhelming trend amongst respondents who’ve replaced and existing internet service with a generative AI tool is the disruption of traditional search engines, primarily Google. While few users have completely abandoned search engines, a signiﬁcant majority now use generative AI as their ﬁrst stop for a wide range of queries, especially those requiring complex answers, research, or coding help. (View Highlight)
What was the most surprising moment you had in the last year with AI? | 287 “Wow” moments for users focused on AI’s rapidly advancing capabilities, particularly in tangible, high-skill areas. Coding was the most frequently cited surprise, with users amazed by AI’s ability to build entire applications and debug complex problems. This was closely followed by the dramatic improvements in media generation (video, image, and audio) and the power of deep research, analysis and emergent reasoning. (View Highlight)
The clearest trend is the adoption of specialized coding tools such as Claude Code and Cursor, which correlates with users stopping the use of GitHub Copilot and, to a lesser extent, ChatGPT for coding tasks. While ChatGPT is the most frequently dropped tool, it’s also still being adopted by many. Gemini and Claude are the primary beneﬁciaries of this churn, with many users citing better performance or speciﬁc features like long context windows as their reason for switching. Users are also dropping single-purpose tools, e.g. Midjourney and Perplexity as the main platforms (ChatGPT, Gemini) integrate these capabilities directly. (View Highlight)
The most frequently used gen AI use cases within organizations | 293 Content, code, research and analysis heavy use cases are unsurprisingly the most popular. (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

Self-proclaimed experts

My failure resume

Tres Millones de viviendas

State of AI Report

Metadata

Highlights

Graph View

Table of Contents

Now Reading

New platform, familiar risks: Zillow and Expedia bet on OpenAI’s ChatGPT apps rollout