Scale AI benchmarked top models (Claude, Gemini, OpenAI) on real multi-file software engineering tasks using actual production-grade code, finding they solved only 20-30% of tasks.

tech

Videos

100%

Confidence

4/8/2026

First Seen

4/8/2026

Last Seen

Source Videos (1)

AI Layoffs Have Completely Backfired (here's the proof) - YouTube

Tech With Soleyman

7:06

View

Related Claims

Anthropic's Claude model was the first agentic AI system capable of productive work like software coding.

tech1 video

On SWEBench Pro, Claude Mythos achieved a 78% score, while Opus previously scored 53% and GPT 5.4 scored 57.7%.

tech1 video

DeepSWE is a new benchmark for coding agents that measures their ability to handle real software engineering work across 91 active open-source repositories, using short, realistic prompts.

tech1 video