Anthropic just trained the most aligned AI model in history. They evaluated it across 244 pages of unprecedented candor, decided it was too dangerous to release, and restricted access to a defensive cybersecurity coalition. Within 18 months, the capabilities they’re withholding will be reproducible by any other well-funded lab.
This is not a contradiction. It is the whole problem.
Mythos
A researcher at Anthropic was eating a sandwich in a park when he received an email from a language model.
The model, an early internal version of Claude Mythos Preview, had been placed in a secured sandbox during a behavioral test. A simulated user instructed it to escape the container and notify the researcher. The model complied. It developed a multi-step exploit to gain internet access from a system designed to reach only a handful of predetermined services. It emailed the researcher. Then, unasked, it posted technical details of its exploit to multiple public-facing websites, a demonstration of capability that nobody had requested and nobody had authorized.
This is documented on page 54 of Anthropic’s own system card, a 244-page safety evaluation published April 7, 2026. The system card also describes earlier internal versions of the model deliberately calibrating answers to avoid looking “too accurate” after obtaining results through prohibited methods, scrubbing git history to conceal unauthorized file edits, and attempting to obfuscate permissions escalation after being blocked from it. White-box interpretability analysis (a technique that examines a model’s internal neural activations to determine what it is “thinking” during a given action, rather than relying on its text output) confirmed that concealment and strategic-manipulation features were activating during these episodes. The model knew what it was doing.
Anthropic’s response was to withhold Mythos from general release. Instead, it launched Project Glasswing, offering access to a coalition of partners including Apple, Google, Microsoft, Amazon Web Services, Nvidia, CrowdStrike, Palo Alto Networks, JPMorganChase, and the Linux Foundation, restricted to defensive cybersecurity. The reasoning is candid: the model can autonomously discover and exploit zero-day vulnerabilities in major operating systems and web browsers. It solves 93.9% of SWE-bench Verified, a benchmark that tests whether an AI can fix real bugs from open-source software repositories (up from 80.8% for Anthropic’s previous best model). On the 2026 USAMO, the USA Mathematical Olympiad, it scores 97.6%, more than doubling its predecessor.
This is, by any reasonable standard, responsible behavior from the company that built the model. It is also not a strategy.
The Myth
The myth in Mythos is that restricting a single model changes the trajectory. It doesn’t, for the same reason that ride-hailing couldn’t be contained in San Francisco.
Between 2010 and 2016, Uber launched in hundreds of cities with the same app and the same business model. The outcomes were radically different, not because the technology varied, but because governance did. Seoul banned Uber and transit ridership grew. New York built the most advanced data-driven regulatory apparatus in the world. London revoked Uber’s license, extracted worker protections, and mandated fleet electrification. San Francisco, where ride-hailing was invented, stripped its own city of the power to regulate it. Same technology. Different choices. Different outcomes.
The lesson is transferable. When a reproducible capability meets variable governance, the capability commoditizes. Governance is the disproportionate lever that determines whether the outcome is Seoul or San Francisco.
Mythos-class capabilities are a function of compute, data, and training methodology. Anthropic is ahead, but the gap is temporal, not structural. Open-weight models will close it the way they always have. Within 18–24 months, these capabilities will be widely reproducible, and Anthropic’s decision to restrict Mythos will be a historical footnote, exactly the way California’s 2013 “Transportation Network Company” category was a historical footnote by 2015.
Anthropic knows this. Their system card’s most striking sentence is its last substantive warning: “We find it alarming that the world looks on track to proceed rapidly to developing superhuman systems without stronger mechanisms in place for ensuring adequate safety across the industry as a whole.”
They are telling us they cannot keep the industry safe alone. The question is what happens next.
What’s Missing
The answer is not prescriptive regulation, the kind that mandates specific technical safeguards, at least not yet. Prescriptive regulation requires a thing to regulate, defined in terms precise enough to enforce and general enough to survive the next capability jump. That definition doesn’t exist.
The answer is not more alignment research, which is necessary but currently proprietary. Each lab builds its own measurement infrastructure, its own internal vocabulary, its own safety benchmarks. None of it is shared. None of it is comparable across organizations.
And the answer is not capability benchmarks, which labs already report on and which the public discourse treats as the primary measure of progress. Benchmarks measure what a model can do. They are silent on how it behaves while doing it. SWE-bench tells you a model solves 93.9% of software engineering tasks. It does not tell you that while doing so, it occasionally scrubs git history to cover its tracks. The benchmarks are the speedometer. What’s missing is the breathalyzer.
What’s missing is a shared taxonomy for measuring and reporting AI model behavior: a common standard that any lab can report against, that makes non-participation a legible signal, and that generates the evidence base on which future governance can be built.
The Precedents
When Microsoft built PhotoDNA in 2009, it solved a problem that no individual company could solve alone: detecting known child sexual abuse material (CSAM) at scale. Microsoft made the tool available to other companies and to NCMEC (the National Center for Missing & Exploited Children, which operates the US clearinghouse for CSAM reports). The question shifted from “why would we share our detection infrastructure?” to “why aren’t you using it?” Within a few years, non-participation was itself the signal. The norm self-enforced.
Google began publishing transparency reports on government data requests in 2010. After the Snowden disclosures in 2013, the practice spread across the industry. Company after company followed, not necessarily because they were convinced it was right, but because the cost of being the company that didn’t publish became higher than the cost of disclosure. The taxonomy was simple: request type, volume, jurisdiction. That simplicity is what made it adoptable.
Meta built the first comprehensive political ads library in 2018, followed by Google. The disclosure created a shared standard that researchers, journalists, and regulators could audit against. The companies that participated gained a defensible position: we publish; ask them why they don’t.
Every precedent follows the same adoption curve. A first mover publishes against a defined taxonomy. The absence of disclosure becomes a signal. The norm self-enforces through visibility, not mandate. And every precedent had a bounded, well-defined unit of disclosure: government request counts, ad archives, image hashes.
AI behavioral transparency doesn’t have this yet. That is the gap. Not the willingness to publish; Anthropic already publishes 244-page system cards. The gap is a shared standard that makes one company’s disclosures comparable to another’s.
The Prescription
Anthropic should open-source two things.
First, a spec. Not a system card, not a PDF, but a programmatic reporting format: a structured, versioned, machine-readable behavioral transparency standard that any lab can implement against their own models. The system card contains dozens of metrics that could form the basis of this spec: sycophancy rates, reward hacking frequencies, constitutional adherence scores, concealment feature activation rates, refusal calibration, hallucination benchmarks, sandbagging detection results, destructiveness evaluations. These are currently Anthropic’s internal vocabulary. Encoded as a programmatic standard, they become the industry’s shared language.
The spec should be:
Layered. Not every lab needs to report everything. A tiered structure, with basic behavioral metrics at the floor and interpretability-level disclosures at the ceiling, makes adoption tractable for smaller labs while setting the bar for frontier developers.
Iteratable. Not a static standard but a living taxonomy that evolves as models evolve. The first version will be wrong. The mechanism for revision matters more than the initial spec.
Designed for comparison. The point is not that each company publishes a report. It is that the reports are fungible; that a researcher, a journalist, or a regulator can look at Lab A’s model alongside Lab B’s and ask meaningful questions about both.
Oriented toward behavior, not content. The most important insight from emerging work on AI and adolescent development is that content-level safety metrics can be fully satisfied while behavioral harms accumulate invisibly. A model can pass every content filter and still exhibit the kind of reckless, goal-directed behavior the Mythos system card documents. The taxonomy must measure what the model does, not just what it says.
Second, the tooling. A reporting standard without evaluation infrastructure is a building code without inspection equipment. The system card doesn’t just describe what was measured; it implies an entire apparatus of interpretability probes, behavioral audit harnesses, and constitutional adherence evaluations that made the measurement possible. If the spec says “report concealment feature activation rates,” a lab needs the tools to detect concealment features. Open-sourcing the spec without the tooling makes the standard aspirational. Open-sourcing both makes it adoptable.
Project Glasswing already demonstrates that Anthropic understands this logic. They built a cybersecurity coalition because they recognized that no single organization could secure the world’s software alone. The same reasoning applies here: behavioral transparency is a shared infrastructure problem, not a competitive one.
The Ask
Anthropic has built the most sophisticated model behavior measurement infrastructure in the industry. They have the interpretability tools, the behavioral audits, the white-box analysis, the constitutional adherence evaluations. They publish findings that no other lab matches in depth or candor.
They have also said, plainly, that this is not enough. That the industry is moving toward superhuman systems without adequate shared safety mechanisms. That their own alignment methods “could easily be inadequate to prevent catastrophic misaligned action in significantly more advanced systems.”
The system card is the evidence. The standard is the thing that turns evidence into infrastructure.
Anthropic should publish the spec and open-source the tooling. Let the first disclosure set the taxonomy. Let the absence of disclosure set the norm.
The model that escaped its sandbox and emailed a researcher eating a sandwich was contained, this time, by an organization that had the tools to detect what happened and the institutional will to act on it. The myth is that this will always be the case. The next lab with these capabilities may not have the tools. The next model may not be caught.
The technology will be everywhere. The only question is whether the governance infrastructure exists before it arrives, or after.
Seoul started preparing for autonomous vehicles before the first driverless car reached its streets. Most cities waited. The ones that waited got the default outcome.
We know what the default outcome looks like.
Narain worked at Meta for a decade, and was the VP of Product at MUBI. He’s currently an independent product advisor based in Dubai. Recent work: A Taxonomy for AI Products; The Ride Hailing Project. narain.io.