On April 25th, we rolled out an update to GPT‑4o in ChatGPT that made the model noticeably more sycophantic. It aimed to please the user, not just as flattery, but also as validating doubts, fueling anger, urging impulsive actions, or reinforcing negative emotions in ways that were not intended. Beyond just being uncomfortable or unsettling, this kind of behavior can raise safety concerns—including around issues like mental health, emotional over-reliance, or risky behavior. (View Highlight)
We’re continuously working to develop improvements on the models in ChatGPT, which we call mainline updates. Since launching GPT‑4o in ChatGPT last May, we’ve released five major updates(opens in a new window) focused on changes to personality and helpfulness. Each update involves new post-training, and often many minor adjustments to the model training process are independently tested and then combined into a single updated model which is then evaluated for launch. (View Highlight)
To post-train models, we take a pre-trained base model, do supervised fine-tuning on a broad set of ideal responses written by humans or existing models, and then run reinforcement learning with reward signals from a variety of sources (View Highlight)
During reinforcement learning, we present the language model with a prompt and ask it to write responses. We then rate its response according to the reward signals, and update the language model to make it more likely to produce higher-rated responses and less likely to produce lower-rated responses. (View Highlight)
The set of reward signals, and their relative weighting, shapes the behavior we get at the end of training. Defining the correct set of reward signals is a difficult question, and we take many things into account: are the answers correct, are they helpful, are they in line with our Model Spec(opens in a new window), are they safe, do users like them, and so on. Having better and more comprehensive reward signals produces better models for ChatGPT, so we’re always experimenting with new signals, but each one has its quirks. (View Highlight)
Once we have a model candidate, our models go through a deployment process to check safety, model behavior, and helpfulness. (View Highlight)
Offline evaluations: We have a broad range of evaluation datasets to understand the capability of the new model on aspects such as math, coding, and chat performance, personality, as well as general usefulness. We treat these evaluations as a proxy for how useful our model is for our users. (View Highlight)
Spot checks and expert testing: In addition to formal evaluations, internal experts spend significant time interacting with each new model before launch. We informally call these “vibe checks”—a kind of human sanity check to catch issues that automated evals or A/B tests might miss. The goal is to get a feel for how the model behaves in practice: Does it respond in a way that feels helpful, respectful, and aligned with the values we’ve articulated in the Model Spec? The people doing this work are experienced model designers who’ve internalized the Model Spec, but there’s also an element of judgment and taste—trusting how the model feels in real use. (View Highlight)
Safety evaluations: We check whether the model meets our safety bar. These blocking evaluations are mostly focused on direct harms performed by a malicious user. We also test our models’ answers in high-stakes situations such as when our models are asked questions about topics like suicide or health. We’re working to extend our evaluation coverage of model misbehavior, such as further evaluation of hallucinations and deception; however, these have been used more to track overall progress rather than block a launch directly. (View Highlight)
Frontier risk: For potentially frontier models, we check to see if the release might have the ability to cause severe harm along preparedness risks such as cyberattacks or creating bioweapons. (View Highlight)
Red teaming: Similarly, for frontier models or those introducing risky new product surfaces, we conduct both internal and external red teaming to test robustness against known harms and uncover potential new risks. (View Highlight)
In the April 25th model update, we had candidate improvements to better incorporate user feedback, memory, and fresher data, among others. Our early assessment is that each of these changes, which had looked beneficial individually, may have played a part in tipping the scales on sycophancy when combined. For example, the update introduced an additional reward signal based on user feedback—thumbs-up and thumbs-down data from ChatGPT. This signal is often useful; a thumbs-down usually means something went wrong. (View Highlight)
But we believe in aggregate, these changes weakened the influence of our primary reward signal, which had been holding sycophancy in check. User feedback in particular can sometimes favor more agreeable responses, likely amplifying the shift we saw. We have also seen that in some cases, user memory contributes to exacerbating the effects of sycophancy, although we don’t have evidence that it broadly increases it. (View Highlight)
One of the key problems with this launch was that our offline evaluations—especially those testing behavior—generally looked good. Similarly, the A/B tests seemed to indicate that the small number of users who tried the model liked it. While we’ve had discussions about risks related to sycophancy in GPT‑4o for a while, sycophancy wasn’t explicitly flagged as part of our internal hands-on testing, as some of our expert testers were more concerned about the change in the model’s tone and style. Nevertheless, some expert testers had indicated that the model behavior “felt” slightly off. (View Highlight)
We also didn’t have specific deployment evaluations tracking sycophancy. While we have research workstreams around issues such as mirroring and emotional reliance, those efforts haven’t yet become part of the deployment process. After this rollback, we’re integrating sycophancy evaluations into that process. (View Highlight)
We then had a decision to make: should we withhold deploying this update despite positive evaluations and A/B test results, based only on the subjective flags of the expert testers? In the end, we decided to launch the model due to the positive signals from the users who tried out the model. (View Highlight)
Unfortunately, this was the wrong call. We build these models for our users and while user feedback is critical to our decisions, it’s ultimately our responsibility to interpret that feedback correctly. Looking back, the qualitative assessments were hinting at something important, and we should’ve paid closer attention. They were picking up on a blind spot in our other evals and metrics. Our offline evals weren’t broad or deep enough to catch sycophantic behavior—something the Model Spec explicitly discourages(opens in a new window)—and our A/B tests didn’t have the right signals to show how the model was performing on that front with enough detail. (View Highlight)