Claude was the best SWE model. Now people are asking questions.
For the last few weeks, Claude was pretty clearly the model to beat for software work. That is why the current mood around it stands out.
Overview
There is now a public issue in Anthropic's 'claude-code' repo where one user documented what they believe is a real drop in quality in complex engineering sessions. The write-up is based on logs, not just frustration, and the broad claim is that Claude started behaving in a shallower way: less exploration, more editing before fully understanding the code, and more back-and-forth to get to a correct result.
Anthropic did respond. Boris Cherny said that thinking redaction was a UI-only change and pointed instead to adaptive thinking and lower default effort as more likely reasons users were noticing a shift. That does not settle the question, but it does make one thing clear: even Anthropic seems to accept that people are seeing a different experience.
The Mood
The same tone is showing up across developer discussions too. The wording varies, but the pattern is familiar: Claude feels less thorough, makes more avoidable mistakes, skips over instructions it used to handle fine, and burns more prompts getting to the same outcome.
That does not automatically prove regression. But when a lot of experienced users start describing the same type of friction at roughly the same time, it is worth taking seriously.
My Take
I do not have the same level of proof as the GitHub issue author, so this part is purely anecdotal. A few weeks ago, some tasks that would normally take one or two prompts now take two to five times more iteration. That is not a formal benchmark. It is just what I have been seeing in practice.
That is enough for me to think we should pay attention, especially because switching models or providers is not a small choice for us. We are not deciding for one or two developers. We are deciding for the organization, so the bar for switching should be high.
What Happens Next
I will be running a comparative SWE performance test between Claude Opus 4.6, Claude Sonnet 4.6, and GPT-5.4. The goal is not to chase whatever is getting the most praise this week. The goal is to test which model is actually more reliable for the kind of work we do: reading before editing, following conventions, staying accurate across longer sessions, and reducing prompt churn.
In the meantime, I would like us to gather our own signal internally. If you have noticed a change in Claude’s performance, share it with specifics:
· What were you trying to do?
· How many prompts or retries did it take versus your normal baseline?
· What failed exactly: instruction-following, repo reading, code quality, hallucinated APIs, or something else?
· Did changing effort, prompt style, or model variant improve anything?
This is not about piling on Anthropic. It is about figuring out whether there is a real enough pattern here to affect team-wide tooling decisions.
Sources → GitHub issue, Boris Cherny comment