Today, we’re excited to introduce Muse Spark, the first in the Muse family of models developed by Meta Superintelligence Labs. Muse Spark is a natively multimodal reasoning model with support for tool-use, visual chain of thought, and multi-agent orchestration.
Muse Spark is the first step on our scaling ladder and the first product of a ground-up overhaul of our AI efforts. To support further scaling, we are making strategic investments across the entire stack — from research and model training to infrastructure, including the Hyperion data center.
In this post, we’ll first explore Muse Spark’s new capabilities and applications. After these results, we’ll look behind the curtain at the scaling axes driving our progress toward personal superintelligence.
Muse Spark is available today at meta.ai and the Meta AI app. We’re opening a private API preview to select users. (View Highlight)
Capabilities for Personal Superintelligence
Muse Spark offers competitive performance in multimodal perception, reasoning, health, and agentic tasks. We continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows.
With larger models in development, these results demonstrate that our stack is scaling effectively. (View Highlight)
We’re also releasing Contemplating mode, which orchestrates multiple agents that reason in parallel. This allows Muse Spark to compete with the extreme reasoning modes of frontier models such as Gemini Deep Think and GPT Pro. Contemplating mode provides significant capability improvements in challenging tasks, achieving 58% in Humanity’s Last Exam and 38% in FrontierScience Research. (View Highlight)
Muse Spark is the first step toward a personal superintelligence that understands your world. From analyzing your immediate environment to supporting your wellness, the advanced reasoning capabilities of Muse Spark enable powerful, highly personal use cases.
Multimodal. Muse Spark is built from the ground up to integrate visual information across domains and tools. It achieves strong performance on visual STEM questions, entity recognition, and localization. These capabilities come together to enable interactive experiences like creating fun minigames or troubleshooting your home appliances with dynamic annotations.
Health. One major application of personal superintelligence is to help people learn about and improve their health. To improve Muse Spark’s health reasoning capabilities, we collaborated with over 1,000 physicians to curate training data that enables more factual and comprehensive responses. Muse Spark can generate interactive displays that unpack and explain health information such as the nutritional content of various foods or muscles activated during exercise. (View Highlight)
To build personal superintelligence, our model’s capabilities should scale predictably and efficiently. Below, we share how we study and track Muse Spark’s scaling properties along three axes: pretraining, reinforcement learning, and test-time reasoning.
Pretraining. The pretraining phase is where Muse Spark acquires its core multimodal understanding, reasoning, and coding abilities — the foundation that reinforcement learning and test-time compute build upon.
Over the last nine months, we rebuilt our pretraining stack with improvements to model architecture, optimization, and data curation. Together, these advancements increase the capability we can extract from every unit of compute. To rigorously evaluate our new recipe, we fit a scaling law to a series of small models and compare the training FLOPs required to hit a specific level of performance. The results are clear: we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison. (View Highlight)
Reinforcement Learning. After pretraining, reinforcement learning (RL) leverages compute to scalably amplify model capabilities. Even though large-scale RL is notoriously prone to instability, our new stack delivers smooth, predictable gains.
The plots below show the benefits of scaling RL compute (measured in steps) for Muse Spark. On the left, we see log-linear growth in pass@1 and pass@16 (at least one success across 16 attempts) on the training data. This indicates that RL is improving model reliability without compromising reasoning diversity. On the right, accuracy growth on a held-out evaluation set establishes that the gains from RL predictably generalize: Muse Spark smoothly improves on tasks that were not seen in training. (View Highlight)
Test-Time Reasoning. RL trains our models to “think” before they answer — a process known as test-time reasoning. Serving this capability to billions of users requires efficient use of reasoning tokens. To achieve this, we rely on two key levers: thinking time penalties to optimize token use, and multi-agent orchestration that boosts performance without slowing down response times.
To deliver the most intelligence per token, our RL training maximizes correctness subject to a penalty on thinking time. On a subset of evaluations such as AIME, this causes a phase transition. After an initial period where the model improves by thinking longer, the length penalty causes thought compression — Muse Spark compresses its reasoning to solve problems using significantly fewer tokens. After compressing, the model again extends its solutions to achieve stronger performance. (View Highlight)
Muse Spark has broad reasoning capabilities across dual-use scientific domains, so we conducted extensive safety evaluations before deployment. Our process follows the updated Advanced AI Scaling Framework, which defines threat models, evaluation protocols, and deployment thresholds for our most advanced models. We evaluated Muse Spark both before and after applying safety mitigations across frontier risk categories, behavioral alignment, and adversarial robustness.
We found that Muse Spark demonstrates strong refusal behavior across high-risk domains such as biological and chemical weapons, enabled by pretraining data filtering, safety-focused post-training, and system-level guardrails. In the Cybersecurity and Loss of Control domains, Muse Spark does not exhibit the autonomous capability or hazardous tendencies needed to realize threat scenarios. Our evaluations show Muse Spark falls within safe margins across all frontier risk categories we measured given its deployment context. Full results will be available in our upcoming Safety & Preparedness Report. (View Highlight)