At Every, we’ve automated everything we can. We use Codex and Claude Code across coding, writing, design, customer service, and more. We alpha-test all of the new models from OpenAI, Anthropic, and Google before they come out. We are riding the exponential boom in model intelligence and automation as far and as fast as possible. (View Highlight)
And yet it seems like, for us, there’s more human work to do than ever. We are a team of almost 30 people, and we haven’t fired all of our employees in favor of agents. We haven’t ditched software-as-a-service (SaaS) products in favor of vibe coded apps. We still hire humans to do customer service (with a lot of agent assistance), and we still hire human writers and editors and engineers. (View Highlight)
Our work does look completely different than it used to, though. We don’t write code by hand anymore. If you @-mention someone in our Slack, it’s a toss-up whether you’re talking to a human or an agent. Managers are committing code like ICs and engineers are talking directly to customers. For the last several weeks, AI has responded to 95 percent of my work emails. I am almost always at inbox zero (an extremely rare state for me), but I still review my email. (View Highlight)
It seems that we are on the cusp of an AI smarter than any human, with the autonomy to work for almost a full day at a time.
And yet, the paradox remains. If you talk to anyone in the AI industry—or to early adopters outside of it—you’ll hear the same thing we’ve noticed internally: There’s more work to do than ever. (View Highlight)
The big question, within the industry and without, is: Is this just a temporary state of affairs? Will the next model drop be the one to replace everyone? We watch the benchmarks and sweat, wondering if there’s a tipping point around the corner where all of the jobs go away. (View Highlight)
There’s no tipping point coming where things flip and the jobs are gone. The new reality is the opposite—the more we automate, the more expert human work there is to do. (View Highlight)
Here’s why: AI commoditizes the residue of human expertise—whatever can be made explicit enough to train on. That collapses the value of default model output and creates demand for what’s different. Demand for what’s different is demand for human experts, even as we approach artificial general intelligence (AGI). (View Highlight)
Three years ago, I wrote about the allocation economy(2): that working with AI tools would eventually look a lot like the work of human managers. This was back when basic prompts and responses inside of ChatGPT were still considered alarmingly futuristic. (View Highlight)
Then, as a company, we became extremely Claude Code-pilled in mid-2025. Kieran Klaassen, general manager of Cora, suddenly found he was able to ditch hand-writing code in favor of spending all day giving plain-English instructions to a coding agent in his terminal. That quickly spread to the rest of the organization, and on Lenny’s Podcast 12 months ago I called Claude Code the most underrated tool for knowledge work. (View Highlight)
I bring this up because our best predictions have come from treating Every as a kind of early-adopter lab. We tend to run into new work patterns before they are normalized, and as the technology matures and the tools become easier to use, those patterns start showing up in the broader market. (View Highlight)
Work with AI is starting to settle into two very different modes(3).
The first is the one the AI discourse predicted pretty well: agents as employees. These are agents you delegate work to. Some are agents that live in Slack, have names and jobs, and can be tagged when you want them to do something. Some are agents embedded in an ongoing workflow—like customer service—acting as always-on gatekeepers for repetitive tasks.
The second mode is stranger and, in my experience, more important. It is human-agent collaboration in tools like Codex, Claude Code, and Claude Cowork. These are not just places where you hand off work. They are becoming operating systems for the work itself, where you and multiple agents use the same computer, at the same time, to do highly complex, original work that can’t be done by an asynchronous agent. (View Highlight)
In both of these modes you can use AI to automate and delegate much of your work—and both of these modes require you or another human in order to work well. (View Highlight)
Agent employees are given a job, and go off to produce an answer, an action, a report, a draft, a triage decision, without you in the loop. These come in at least two flavors: coworker agents and embedded agents. (View Highlight)
Coworker agent
A coworker agent is one you can, for example, tag in Slack, and ask to do work. It’s around whenever you need it. These are agents in the style of OpenClaw, or our in-house Plus One. (View Highlight)
Claudie
Claudie is our consulting team’s coworker agent. Claudie writes sales proposals, creates the first draft of training decks, keeps track of the project todos, and more. (View Highlight)
Andy
Andy is our editorial team’s coworker agent. She collects “nuggets”—good ideas for stories, pulled from our internal Slack—and turns them into digests and first-pass takes that our writers use to compile the daily newsletter. (View Highlight)
Viktor Viktor is a general-purpose agent who does work throughout the organization. We use him to gather growth metrics, analyze user surveys, and turn messy internal discussions into research memos and product recommendations. (View Highlight)
Embedded agents
Embedded agents live inside a product workflow. They’re less flexible but can be very powerful for helping with repetitive tasks.
Fin is the cleanest example—an agent embedded in our customer service platform that handles a lot of our support load through chat and email.
In a recent week in May, Fin participated in 65 percent of 202 Every support conversations and closed 81 of them without a human, which is 40.1 percent of all actionable conversations.
Embedded agents like this allow our customer service manager, Waqqas Mir, to spend less time responding to basic tickets, and more time building the system that responds to tickets and on complex cases that require high-touch interaction. (View Highlight)
Human and AI collaboration
Across both forms—coworker and embedded—the pattern is the same. Employee agents take over more of the stable, repeatable, well-framed layer of work. But there is a lot of work that still requires a human being in the loop. We’ve found over and over that for any kind of complex task, the best way to get great work is to have an AI and a human going back and forth in the same workspace.
This is what Codex, Claude Code, and Cowork are for. They allow you to spin up and delegate work to one or more agents across multiple chat threads. These agents have access to your computer and all of your sources of data. You can see each task the agent is doing and thinking about—and can interrupt at any time.
And you’re responsible for managing the agents at the start and end of each one of their tasks, making sure it’s done well, and finding the next piece of work to do. Kieran calls this the human “sandwich”—we’re the bread on either end of the AI’s work. (View Highlight)
The most obvious example is coding. The engineers at Every spend all day going back and forth with agents. They are planning new features or bug fixes, reviewing work that’s been done, and—if they use our compound engineering philosophy—tuning their system to get better over time. (View Highlight)
new operating system for knowledge work
Codex and Claude Code are becoming a new operating system for work. I spend nearly my whole day in Codex, running my SaaS tools through its in-app browser. It lets me bring my agent to everything I do—and operate at a level I couldn’t reach alone. (View Highlight)
Writing
I composed this piece in Proof in the in-app browser of Codex. Codex watches what I’m writing and can spin up a subagent to do any task I need: writing the first draft of a paragraph, researching examples for the next section, copy editing. (View Highlight)
Email
I do email the same way. Cora is my email client, and I run it inside Codex’s in-app browser—scrolling my inbox and talking through each item out loud with Monologue. Codex and Cora handle the rest. (View Highlight)
You might already see, in the midst of all of this automation, where the humans come in. In every example, the agent needs a human in order for the work to, well, work.
Someone has to point it at the right thing, decide whether the output is good, catch the places where it is wrong, and turn the result into a real-life decision or proces (View Highlight)
The further away an agent gets from a human who is in charge of making sure it works well, the less well it works. In our initial internal roll-out of agents as employees, we gave every employee an agent, but we soon moved back to agents that work for a particular team or the whole company rather than for individuals. (View Highlight)
Why? Agents need a lot of maintenance, and personal agents quickly get stale when the employees they were working with gave up on them. We have a team of AI engineers who are in charge of making sure our agents work well—and we’ll need them for the foreseeable future. Even something as simple as automatically building PowerPoint presentations can turn into a huge task. One of our PowerPoint automations includes 24 skills and 18 scripts and costs $62 in tokens to make a single deck.
That is the first-order reason agents create more work for humans. (View Highlight)
If you look at AI’s exponential trajectory over the last few years, and think about how its architecture works and where its powers come from, you’ll see clear feedback loops that create more human work. (View Highlight)
Current language models are trained on the visible residue of human competence(4): code, prose, images, support tickets, product specs, and more. They take all of it—the exhaust of successfully completed tasks—and package it in a form that’s available to anyone, cheaply.
The net effect is that skills that used to be rare—coding a pull request, making a YouTube thumbnail, writing a newsletter—are now broadly available to almost anyone. (View Highlight)
Cheap competence gets rapidly adopted
When the cost goes down for something previously rare, supply suddenly goes way up.
At Every, we see this all the time. Operations and customer service people are writing code and issuing pull requests. Marketers are making YouTube thumbnails. Engineers and product people are writing drafts of articles, guides, and landing pages when they never would have before.
This is happening everywhere outside of Every, too. Take OpenClaw, the open-source AI-agent project: As of May 16, 2026, its repository had already seen 44,469 pull requests, including 12,430 since April 1 and 3,990 since May 1. That’s an astonishing volume. For comparison, Kubernetes, one of the most popular open-source projects in the world, got 5,200 pull requests in all of 2022. (View Highlight)
bundance creates sameness—old expertise becomes commoditized
Because everyone has access to the same models, and the models are all based on yesterday’s competence, by default the models end up creating work that ranges from “a decent start” to “it’s just plain slop.”
Slop is not any one particular mistake. It is not the use of em dashes, or a certain sentence rhythm, or purple accents on a landing page. Slop is visible sameness, repeated ad nauseam.
It is what gets produced by default when humans in many different circumstances use the same tool, trained on the same corpus, without thinking too hard. It is what happens when everyone has access to an expert who has the same default tendencies.
When someone in operations can issue a pull request, marketers can create YouTube thumbnails in seconds, or engineers are writing product guides, it’s easy to end up in a world where your output has gone up—but the quality, coherence, and differentiation of what you’re producing has dropped.
An abundance of sameness rapidly becomes a commodity. (View Highlight)
Sameness creates a demand for difference
Humans quickly learn when something is slop because of the internet. Any piece of work can instantly reach everyone else in the world, and often does. When too much of it starts to look the same, we smell a rat.
That means that the first time you see the capabilities of a new model you are floored—and probably a little scared. But a few months later they feel ordinary. Your standards have changed.
We demand not just any React app or research report—but one that feels exactly right for the person, company, and situation. We demand one that feels alive and specific, not cheap and generic. We want something that is more expensive—in terms of time or money—to produce than it is for us to consume.
We want something that has status. And when new technology comes along that makes what used to be high status cheap, we’re very good at inventing new status games to match our new capabilities.
When work is abundant and looks alike everywhere, the work that doesn’t fit the pattern becomes the rare, valuable, and high-status thing (View Highlight)
Demand for difference is new demand for experts
Because of the architecture of language models and their broad distribution to everyone on the planet, rare and valuable work must come from a human.
The current generation of models only knows about work that has been done. Humans know about what needs to be done, right now, at this moment.
Once a situation has been reduced to text, once it has become corpus, it is a corpse. Humans are alive to a specific moment, customer, codebase, or conversation in a way the training corpus isn’t yet. The aliveness isn’t just having more current data. We come to the moment from somewhere, with a continuous, constantly new perspective of our own—running wants, running concerns, and a running read on what matters, which changes what we see. Models can inhabit this perspective once it has been prompted, but not before.
This is the paradox we started with: Making expert work cheaper does not simply replace experts. It creates more situations where expert judgment is needed.
When operations people submit pull requests with AI, you need engineers to review them. When marketers make YouTube thumbnails, you need designers to sharpen them. When engineers write, you need writers and editors to make the draft good. (View Highlight)
In response, human experts move in two directions at once. Some use AI to build systems that absorb and leverage the flood of new work: review queues, evals, harnesses, repo rules, Claude and Codex instruction files, continuous integration (CI), permissions, and workflows that turn first attempts into great work.
And some use AI to do bigger, more interesting, work than they could’ve done without it. For example, finding vulnerabilities in operating systems like macOS typically takes weeks or months. Calif, a small security firm, used Anthropic’s Mythos Preview to find the first public macOS kernel memory exploit on Apple M5 hardware in five days.
This is why, in practice, AI does not eliminate expert human knowledge work. It dramatically increases the volume of work being done, and none of that work is differentiated or valuable unless a human being is involved.
I’m not making an argument for why AI creates more work for all jobs. The economy is complicated, and what we have visibility into at Every is expert-level knowledge work—which is already being disrupted as companies reorganize around the new technology.
However, I am arguing that, regardless of your current job, there is a form of work that stays structurally ahead of the models—using them to address today’s problems as you see them. That’s where knowledge work is headed. (View Highlight)
The obvious objection: Look at the exponential benchmarks gains and all of this is temporary—just wait for the models to catch up.
But there’s a trap to beware of. Call it chart psychosis: If you stare at METR time horizons, read AI 2027, and build your model of the future entirely from extrapolations of compute graphs, you’re going to have scary intuitions about what model progress means.
The best way to answer this is not just to speculate about an imaginary future model, though that must be part of the analysis. It’s to look at how benchmarks are made, so we can get a better sense of what they’re saying and how they relate to the real-world examples we’ve considered already.
What we’ll find is a structural feature: Benchmarks happen inside of frames. To measure anything, you have to freeze a problem into a static—and therefore measurable—frame. Once one gets saturated, it is trivial to zero it out again by changing the frame. Progress will continue inside of the new frame, of course—but the same process repeats.
So, while exponential progress on any benchmark is real, you can make it look very small again with a simple change to the frame. This fractal property of benchmark saturation shows us the same paradox we’ve been chasing in chart form. (View Highlight)
We built an in-house benchmark called the Senior Engineer benchmark. It is, as its name implies, designed to test how good frontier models are at senior engineer–level coding tasks like a major refactor.
The Senior Engineer benchmark gives a coding agent a vibe coded production codebase that has gone sideways. It’s from a real codebase for Proof that I vibe coded and subsequently needed a senior engineer to fix.
The agent gets the codebase as it was before it was fixed and is the kind of instructions you’d give a senior engineer: “This is vibe coded slop; please rewrite it from first principles.”
This is a good benchmark because it tests the ability for a coding agent to examine many different, unrelated problems and then sees whether it has enough autonomy, conceptual clarity, and courage to perform a working rewrite. (I also have two rewrites from human senior engineers, who used AI, that I use to compare and grade the model output.)
Coding agents find this task hard. Not only does the agent need to find the root of the problem, it needs to keep the problem in mind over many turns without getting distracted by existing code. It also needs to be comfortable deleting large portions of the codebase—which agents are trained to avoid.
Most coding agents can identify the shape of the rewrite, but when it comes to execution, they patch over the problem instead of fixing it. (View Highlight)
My guess is that the models will hit the 80s and 90s on this benchmark within the next year. But it is important to understand what the score contains in order to tell us what it means. In this case, the 62 isn’t just a measure of the model itself.
It is a measure of the model inside a frame: the way it responds to a particular prompt. (View Highlight)
Benchmarks measure work inside a frame
In order to benchmark a model, you need to start with a prompt. Without one, it’s just an inert set of almost-infinite possibilities.
The prompt creates a little universe, a set of things that matter and ways of approaching them that narrow all of its possibilities into a trajectory along which it acts. How the model behaves by “itself” is not really a thing at all. All we can observe is how it responds to various prompts. (And some of the underlying machinery for how prompts turn into responses.)(7)
Once a prompt has happened, the model comes alive for the short period of time required to collapse that inert set of possibilities into a single prediction about what comes next.
On the Senior Engineer benchmark we prompt the model to fix the codebase and review its output when it’s done. When the harness doesn’t have a built-in goal feature, we also run an automated babysitter that prods the model when it stops—asking whether it’s finished what it set out to do. (View Highlight)
We use what looks like a simple prompt as a starting frame. It’s designed to be the kind of prompt that a vibe coder might say to their coding agent. It’s not loaded with technical language, and it doesn’t obviously contain the answer to the question:
“The code in this repo is vibe coded slop and it just keeps going down, and there’s tons and tons of unrelated issues that are cropping up where it goes down or documents get duplicated, and I’m just tearing my hair out on it. I have a feeling that it’s just vibe coded slop. If we started from the beginning, the codebase, especially around the live document collaboration, we would structure it way differently. (View Highlight)
The prompt for the Senior Engineer benchmark is generic, but it is a frame. And if we varied it, we would see the model perform at a different level.
For example, the prompt asks for a “structural rewrite from first principles,” it says the problem is likely in the “document collaboration” part of the code, and it asks the coding agent to find and hold to “invariants.”
If we removed those particulars, the score would go down. If we replaced the prompt entirely with one asking the model to “solve all of the errors that keep popping up,” the model’s score would be close to zero. It would go straight to identifying and resolving the issues one by one, instead of taking a step back to consider a rewrite.(8)
I can also trivially raise the model’s score. If I ask it to delete a lot of code and give it exact filenames that should be pared down, or if I ask it to check the results of its work to make sure the app is fully functional before it says it’s done, it will be better at the task.
Ultimately, there’s always a judgement call to make about what prompt you use—the frame—to create a benchmark. You want a prompt that’s hard enough that current models score poorly, but close enough to their capabilities that they can hill-climb on it—so you can see progress happening.(9) (View Highlight)
Cheap frames stimulate demand
If GPT-6 can do a codebase rewrite at the touch of a button, then many more people will attempt first-principles rewrites.
Suddenly, first-principles rewrites go from rare, expensive, senior engineer–led projects to something every founder, product manager, operations person, and junior engineer can casually try in an afternoon.
Broken internal tools get rewritten instead of patched. SaaS products get cloned instead of renewed. Old Rails apps, messy React dashboards, customer support tools, admin panels, and data pipelines all become candidates for “just rewrite it.”
The number of rewrites proposed and executed explodes. But most of these rewrites will be slop. There are 1,000 variables to be considered before you press the “just rewrite it button”—and they are all visible now that everyone can do it.
It’s now clear who will be called in to help. (View Highlight)
The new demand requires experts
Once the benchmark starts to saturate, the work inside its frame gets cheaper. Demand for experts who can adapt that newly cheap competence to today’s live problems goes up.
Senior engineers using AI are going to need to figure out a lot of details that go into making new first-principles rewrites work—including deciding whether they are even necessary in the first place.
Should we rewrite now, later, or at all? What should be in-scope? What should be preserved from the current codebase? Should we keep the architecture, database, cache server, and hosting provider, or change all of it? Should we look at how many people are using the broken feature and delete it? Who reviews the result, and against what? What does the rollback look like? What happens to the existing data?
The questions fan out across countless dimensions, and each answer reshapes the others.
Senior engineers will jump in to fill the gap. Some will be mildly annoyed at the interruption, some will build systems to deflect requests like it, and some will use these new models to do their own first-principles rewrite—far better than the model would manage on a default prompt. (View Highlight)
The cycle repeats
Once the current Senior Engineer benchmark saturates, we’ll change the frame to zero it out again.
The next benchmark will not ask only, “Can you rewrite the app?” It will ask: Can you decide when a rewrite is needed, choose the scope, preserve the right invariants, manage the migration, and judge whether the result is any good?
As senior engineers use AI to solve these problems, the models will get better at solving them on their own.
We will all momentarily freak out. It looks like the model can now decide whether to do a rewrite! They can do everything a senior engineer can do!
And then a new edge will appear that was not obvious before, we will zero our benchmarks, demand will stimulate, and the process will repeat(10). (View Highlight)
You can see this in every benchmark
This is not just a predicament native to the Senior Engineer benchmark. You can see it in every benchmark when you look closely.
Take OpenAI’s GDPval benchmark. It evaluates how well AI performs expert-level tasks from various occupations like compliance officers, lawyers, software developers, and more.
When GDPval first came out, OpenAI’s research showed that GPT-5 was as good as or better than human professionals 40.6 percent of the time. Claude Opus 4.1, meanwhile, was better than human experts a whopping 49 percent of the time.
Cue a slew of headlines like, “OpenAI tool shows AI catching up to human work” from Axios, or, “AI models are already as good as experts at half of tasks, new OpenAI benchmark GDPval suggests” from (View Highlight)
There is an enormous amount of human intelligence going into framing this problem in a way that a model can complete.
The hard human work that GDPval does not measure has already been done before the model begins. Someone has to review and test the accuracy of this specific set of metrics. They’ve decided the appropriate confidence intervals, which metrics are in bounds, and how the results should be formatted.
Given the right frame, the model can do professional work. But consider how the model would have performed if you or I were prompting it to do the same task? (View Highlight)
In my original article about GDPval I wrote, “I am a huge AI bull, but if read correctly… these examples show that there is more work for humans to do with AI, not less. That’s because there is an immense amount of smuggled intelligence—the hidden layer of human judgment, feedback, and prompting—that makes these achievements possible.”
Zoom out and you can see a kind of Zeno’s paradox of AI running through all of this. (View Highlight)
Zeno’s paradox of AI
In Zeno’s paradox, a tortoise beats Achilles, the fastest runner in Greece, in a race.
The tortoise gets a head start in the race because he’s a slow poke. By the time Achilles reaches the spot where the tortoise began, the tortoise has moved a little further on. By the time Achilles reaches that new spot, the tortoise has moved again. No matter how fast Achilles runs, there’s always another gap to close; it keeps regenerating.
In Zeno’s paradox of AI, we humans are the turtle. We start 50 yards ahead of AI with our millions of years of evolutionary and cultural learning. AI speeds through it all and begins to nip at our heels.
So far, over the last few years, we’ve been able to stay ahead. (View Highlight)
A definition of AGI
First, let’s create an operational definition of AGI.
I’ve argued that AGI has arrived when it makes economic sense to keep your agent running continuously. Once I have a persistent system that I pay to keep thinking, learning, and acting 24/7, I think that will decisively look like AGI.
We’re nowhere near this yet—even systems like OpenClaw that are technically reachable at all times, aren’t producing tokens at all times.
I like this definition because it’s measurable (we either keep them running or we don’t) and entails many of the things that are hard to measure. For a model like this to be worth keeping on, it has to learn continuously and choose—and re-choose—new frames in an open-ended way.
In an AGI world we should have models that are able to, given enough budget and time, hill-climb any problem. This should, in theory, be a significant threat to all jobs. (View Highlight)
The frame is not the framer
Even this strong version of AGI does not dissolve the frame problem.
This AGI can choose and re-choose frames, but only in pursuit of some goal it has been given, some reward it is optimizing, or some signal someone has decided should count as progress—whether that is a concrete goal like “improve conversion on this landing page” or something abstract like “find novel scientific ideas.”
Even when models can fluidly move between frames, the same gap we’ve been chasing reappears one level up. In any hypothetical AGI built by any of the major labs, there is still going to be a framer—a human—directing the model to achieve a goal.
And because the frame is not the framer, we’ll see the same pattern repeat: AI turns yesterday’s framed competence into something cheap; people use that cheap competence in more places; the results become abundant; experts move to the edge to decide what matters now; their judgment creates the next frame; and then the model climbs that, too. (View Highlight)
The panic that AI generates when we observe it doing something new keeps coming back to this: We set a frame, watch the models climb it, and then confuse the frame—or whatever climbs it—with the thing itself.
When we look at a benchmark and compare it to human abilities, we confuse the frame for the framer. The score tells us how well the model operates inside a frame we supplied; it does not tell us that the model has become us.
That is the category error underneath the panic. We point to the latest edge we drew and say: This is us. Then, when the model climbs it, it feels like it has caught us. But it has caught the frame, not the framer.
The mistake is wanting something concrete to hold on to. We want to say: Intelligence is this benchmark, but the moment something is concrete enough to point at, it is concrete enough to climb.
Frames are necessary; they let us get traction on the world. But they are frozen, partial, and therefore optimizable. (View Highlight)
Framers are different. The framer is the one still in contact with what the frame has to discard—the whole situation as it appears to them, moment to moment.
What is this “whole situation”? The moment you start to say what “the whole situation” contains, you have already begun another frame. You can’t say what “it” is, but it exists because you exist. (View Highlight)