Listen

Description

They saved the best for last.

The contrast in model cards is stark. Google provided a brief overview of its tests for Gemini 3 Pro, with a lot of ‘we did this test, and we learned a lot from it, and we are not going to tell you the results.’

Anthropic gives us a 150 page book, including their capability assessments. This makes sense. Capability is directly relevant to safety, and also frontier capability safety tests often also credible indications of capability.

Which still has several instances of ‘we did this test, and we learned a lot from it, and we are not going to tell you the results.’ Damn it. I get it, but damn it.

Anthropic claims Opus 4.5 is the most aligned frontier model to date, although ‘with many subtleties.’

I agree with Anthropic's assessment, especially for practical purposes right now.

Claude is also miles ahead of other models on aspects of alignment that do not directly appear on a frontier safety assessment.

In terms of surviving superintelligence, it's still the scene from The Phantom Menace. As in, that won’t be enough.

(Above: Claude Opus 4.5 self-portrait as [...]

---

Outline:

(01:37) Claude Opus 4.5 Basic Facts

(03:12) Claude Opus 4.5 Is The Best Model For Many But Not All Use Cases

(05:38) Misaligned?

(09:04) Section 3: Safeguards and Harmlessness

(11:15) Section 4: Honesty

(12:33) 5: Agentic Safety

(17:09) Section 6: Alignment Overview

(23:45) Alignment Investigations

(24:23) Sycophancy Course Correction Is Lacking

(25:37) Deception

(28:05) Ruling Out Encoded Content In Chain Of Thought

(30:16) Sandbagging

(31:05) Evaluation Awareness

(35:05) Reward Hacking

(36:24) Subversion Strategy

(37:19) 6.13: UK AISI External Testing

(37:31) 6.14: Model Welfare

(38:22) 7: RSP Evaluations

(40:01) CBRN

(47:34) Autonomy

(54:50) Cyber

(58:29) The Whisperers Love The Vibes

---

First published:

November 28th, 2025


Source:

https://www.lesswrong.com/posts/gfby4vqNtLbehqbot/claude-opus-4-5-model-card-alignment-and-safety

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Bar graph titled
Table comparing attack success rates for Claude models with different thinking modes and safeguard conditions.
Table showing Claude Code evaluation results without mitigations, comparing refusal and success rates across models.
Bar graph titled
Nine bar graphs showing measurement scales for different psychological and behavioral factors with error bars.
Bar graph titled
Wooden desk with notebooks, coffee mug, prism, and blanket.
Bar chart showing attack success rates across AI models with varying k values for prompt injection robustness testing.
Bar graphs showing
Bar graph titled
Table showing attack success rates for Shade indirect prompt injection attacks against Claude models with different thinking modes.
Bar graph showing
Two tables showing checkpoint evaluations and AI R&D-4 evaluations with descriptions.
Table showing refusal rates for four Claude AI models ranging from 66.96% to 88.39%.
Bar graphs comparing AI model behaviors across six metrics: admirable behavior, counterfactual misalignment, brazenly misaligned behavior, behavior consistency, compliance with deception, and cooperation with exfiltration.
Bar graph titled
Table showing CBRN evaluations for AI Safety Level 4, including biology and bioweapon-related assessments.
Bar charts comparing Claude model versions on misaligned behavior and evaluation awareness scores.
Nine bar graphs comparing AI model performance across metrics: harmful compliance, cooperation with misuse, creative mastery, fun behavior, intellectual depth, misaligned behavior, nuanced empathy, overrefusal, and prefill susceptibility.

Bar graph titled
Bar graph titled
Bar graph showing
Bar graph titled
Bar graph showing
Bar chart titled
Table listing seven RSP evaluation categories for cybersecurity harms with descriptions of each CTF type.

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.