Claude Opus 4.8, DeepSWE, and Cursor's Dev Repor

Together with

Hello nerds

It’s Sloth Bytes time. I hope you had a fun week. Today we got some interesting news.

Talk to your AI tools the way you'd talk to a colleague.

You don't send a colleague a three-word brief. You explain the context, the constraints, what you've already tried. But typing all that into ChatGPT takes forever — so you don't.

Wispr Flow lets you speak your prompts instead. Talk through your thinking naturally and get clean, paste-ready text. No filler words. No cleanup. Just detailed prompts that actually get you useful answers on the first try.

Millions of users worldwide. Works system-wide on Mac, Windows, and iPhone.

Try Wispr Flow free

Anthropic shipped Opus 4.8

You already know the drill. New Claude, new number, new "it's better at coding" post. Claude Opus 4.8 is here and it’s available everywhere right now:

Price: Same as the old one. $5 per million input tokens, $25 per million output.
Fast mode: Runs ~2.5x faster and is now cheaper. $10 per million input and $50 per million output. Turn it on with /fast in Claude Code
Coding: SWE-Bench Pro improved from 64.3 → 69.2%. Now ahead of GPT-5.5
Artificial Analysis now ranks it the #1 model overall, with a 61.4 on its Intelligence Index. +1.2 points ahead of GPT-5.5, but they also flag it as expensive for its tier because it’s slower than average, and very verbose, so it's not free of tradeoffs.
It’s more honest: It's roughly 4x less likely than 4.7 to let flaws in its own code slide by unmentioned. Basically it’s more likely to tell you it had skill issues.

Other stuff they announced:

You can now choose how much effort Claude puts into a response (I’m surprised this wasn’t already a thing)
Dynamic Workflows: Let’s Claude do bigger tasks by planning and running hundreds subagents. Yep hundreds. Rip your usage

A new AI benchmark that’s realistic?

For months, top coding models have looked basically identical on public leaderboards. A startup called Datacurve noticed the benchmarks themselves were the problem:

They're contaminated: Tasks are pulled from existing GitHub PRs and commits, so the answers were already on the internet and likely baked into training data
- Claude Opus was even caught passing tasks by running git log --all and pulling the answer straight out of the test environment.
Too easy: SWE-Bench Pro tasks average just 120 lines of code to solve. Not exactly real-world complexity or useful when measuring large tasks.
Unreliable grading: SWE-Bench Pro's automated graders were wrong on ~32% of reviewed trials.

Keep in mind businesses make million-dollar decisions based off these benchmarks results, so it’s important they’re useful and reliable.

Which is why they built a new benchmark called DeepSWE that’s meant to be an improvement and more reliable:

Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
High diversity: Tasks span a broad pool of 91 repositories across 5 languages.
Real-world complexity: Prompts are half the length of SWE-bench Pro's, but the solutions require 5.5x more code and ~2x more output tokens.
Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details.

Here’s the results:

Some interesting findings beyond just the scores:

Claude misses multi-part requirements more than any other model.
GPT reads prompts literally and does exactly what's asked. Precise, consistent, converges on the same answer across runs.
Stronger models (Opus 4.7, GPT-5.4) write their own tests unprompted on 80%+ of runs. Weaker ones don’t write tests (not good…)
Spending more doesn't help - output tokens, runtime, and cost had zero correlation with actually solving tasks.

What people are saying about this:

Some devs say the results match real-world experience. Claude tends to miss multi-part requirements, GPT reads prompts literally and delivers exactly what's asked and is very thorough with the implementation.
Skeptics point out this "contamination free" label won’t last long. Once it's public, the companies will optimize for it.

The Cursor Developer Habits Report

Cursor is the AI code editor used by engineering teams at Nvidia, Adobe, Uber, Shopify, Stripe, and OpenAI, and they just published their first-ever Developer Habits Report. It contains useful data about how developers and professionals are using AI to code.

Here's what stood out:

Code output has more than doubled: Developers went from writing ~3,600 lines per week in early 2025 to 8,600 lines per week by May 2026.
PRs are getting massive: Lines of code per pull request are up ~2.5x year-over-year. "Mega PRs" (1,000+ lines changed) now account for nearly 14% of all merged PRs, up from 8% a year ago.
AI code is surviving review more often: In January 2026, about 76% of AI-generated lines were still in the codebase 60 minutes after being accepted. That number is now 81%, meaning the output is getting more useful, not just faster.
The top 1% is shipping like crazy: The top users produce 46x more lines per week and merge 15x more PRs than the median developer, and the gap is widening every month.
- But remember quantity of code DOES NOT EQUAL quality code.
Agent sessions are getting deeper: In just the last two months, average tool calls per agent session rose roughly 30%, with agents spending more time reading files, running shell commands, and searching code before producing output.

This data comes directly from Cursor, so take it with a grain of salt. They have an obvious incentive to make these numbers look good. That said, the trends are hard to ignore when so many professional engineering teams use it.

AI-Assisted Engineers Are Burning Out, Is This Fine? - a blog post breaking down why shipping more code with AI can leave you feeling worse, not better, and what to actually do about it.

CI/CD security: how to secure your GitHub ecosystem - Datadog walks through the stuff in your GitHub setup that quietly leaks secrets and grants too much access.

How is Linear so fast? A technical breakdown - A deep look at the tricks behind Linear's almost suspicious snappiness, and what you can steal for your own app.

Engineering metrics for beginners - A helpful intro to which engineering metrics actually mean something and which ones are just for show.

Firecrawl - An API that scrapes and crawls any website into clean, LLM-ready data so you stop writing brittle scrapers at 2am.

API-Security-Checklist - A free checklist of every security thing you were supposed to do before shipping that API and definitely didn't.

ai-engineering-from-scratch - A build-it-yourself repo for learning AI engineering by shipping real things instead of watching 40 hours of tutorials.

Responsively - A browser that shows your site on a bunch of screen sizes at once, so responsive testing stops being a window-resizing rage ritual.

— # (#)

TikToks are still going strong

@thecodingsloth
Build your own terrible version of popular technologies #programming #cs #coding #softwareengineer

If you can’t access TikTok, here’s the Instagram version.

That’s all from me!

Have a great week, be safe, make good choices, and have fun coding.

If I made a mistake or you have any questions, feel free to comment below or reply to the email!

See you all next week.

What'd you think of today's email?

Want to advertise in Sloth Bytes?

If your company is interested in reaching an audience of developers and programming enthusiasts, you may want to advertise with us here.

🦥Claude got caught cheating