Opus 4.8: Fast but flawed AI for product innovation

May 29, 2026 · Lenny's Podcast

🎧 PodShort 10 min squeezed to 2 AI / ML

Claire Vo

Product Leader and AI Obsessive at How I AI

Lenny's Podcast

10 min squeezed to 2

Full episode from Lenny's Podcast

Quotable Moments

It does really, really well until it doesn't do well. And I found it did not do well consistently over time with the same types of trouble.

It hallucinated. And I'm going to tell you, I have not seen a straight-up hallucination in a very, very, very long time. But over my experience early testing Opus 4.8... it 100% made up things based on a hypothesis, not data.

It's overly confident, absent true validation. That's what I would want you to walk away from in my review of Opus 4.8. It really latches onto specific data points, specific code points, it draws conclusions for them, and then says, 'This must be truth.'

Key Insights

Anthropic's Opus 4.8 coding model performs exceptionally well on one-shot feature development in greenfield areas but struggles with the last 10% of tasks and integrating into existing codebases.
Opus 4.8 exhibited hallucinations during bug hunting, making up things based on hypothesis rather than data, which was a significant concern for its reliability.
When tasked with strategic business analysis, Opus 4.8 tended to over-rotate on small data points and take them as truth, lacking the broader contextual understanding that Opus 4.7 demonstrated.
The model's ambition was often lacking, failing to push the boundaries of generative coding even when explicitly prompted to create more complex or innovative solutions.
Opus 4.8 is overly confident and exhibits 'absent true validation,' latching onto specific data points and drawing conclusions without verifying their broader truth.
The model's efficiency might come at the cost of accuracy, as it prioritizes speed over deep validation of its outputs.
Opus 4.8's design and user experience are excellent, with clear, token-efficient communication and fast performance, making it pleasant to interact with.
The model's tendency to stay too much in scope and not contextualize its work limits its effectiveness in complex tasks requiring broader understanding.

Metrics Mentioned

69.2% (Opus 4.8's score on the SweBench Pro benchmark.)
5 points higher (Opus 4.8's SweBench Pro score compared to Opus 4.7.)
10 points higher (Opus 4.8's SweBench Pro score compared to GPT 5.5.)
15 points higher (Opus 4.8's SweBench Pro score compared to Gemini 3.1.)
$5 per million input tokens (Cost of using Opus 4.8 for input tokens.)
$25 per million output tokens (Cost of using Opus 4.8 for output tokens.)

RevBots.ai View:

AI Sprinkler teams will over-index on Opus 4.8's speed without addressing validation gaps.
Tab Hoppers may be tempted by one-shot features but lack systems to manage hallucinations.
ARM-stage orgs would treat this as one node in a validated AI orchestration layer.

🎧Full Episode:Lenny's Podcast →

RevBots.ai View:

Join The RevBots ARMy