AI Excellence Newsletter

The Moat Is Shrinking. The Bigger Problem Is Usage Limits.

For a while, the question felt simple: which model is actually best for software work? I think that question matters less now than it did even a few months ago.


AI News

Overview

DeepSeek, Qwen, Kimi, Claude, GPT, Gemini, even Grok. At this point, all of the frontier models can build foundation-level code. They are not identical, and I am not saying the differences are fake. Some are still better at certain things. Some feel sharper in longer sessions. Some are better at following instructions cleanly. But the gap is not wide enough anymore to act like one lab still has some huge technical moat all to itself. That part is shrinking fast.

What Matters Now

If raw model capability is flattening out, then the next battle looks a lot more like ecosystem.

Claude has Cowork. GPT now has similar integrations through Codex. Everyone is trying to become the place where the work actually happens, not just the model you occasionally ask a question.

That feels very familiar. Best specs do not always win. The product that fits into your workflow cleanly usually does. Apple did not win by being a benchmark chart. They won because the whole thing felt seamless. I think that is where this is going too.

What This Means For Us

Honestly, I do not know yet.

We are a software engineering services company, so what we care about most is SWE capacity. Design and QA are secondary. Documentations and maybe even some management work are interesting, but they are mostly adjacent. The real question for us is pretty simple: which setup actually helps engineers ship more work with less friction?

A few months ago, Claude had a very obvious lead here. Opus and Sonnet models were unarguably the best LLM models to use, it felt like the best coding model by enough margin that the tradeoff was easy to accept.

I do not think that lead looks as obvious now. The capability gap has narrowed, and the developer community seems a lot more willing to look elsewhere, partly because the usage limits are just annoying enough to change behaviour. Claude is restrictive. GPT is not exactly generous either. If one five-hour cap eats a big chunk of your weekly allowance, that stops feeling like a product detail and starts feeling like an operational problem.

The Real Problem Right Now

Right now, I think our biggest problem is not model quality.

It is usage limits.

If you are using these tools heavily for SWE work, you burn through headroom fast. Once that happens, the conversation changes from “which model is smartest?” to “how do we stop wasting tokens and stretching context for no reason?”

So the thing I care about most right now is context management. Not because it is exciting. Some of it is literally caveman. But it might be the thing that matters most in practice.

  • caveman. This skill makes the whole conversation happen in caveman-style. Very blunt. Very funny. Also useful. If the model says less, we get more room to work.
  • lean-ctx. This one is getting more interesting. The claim is that it cuts context bloat before the payload even gets sent to the provider. If that holds up, it could be one of the more practical tools in this whole layer.

That is the phase I think we are in now. Less obsessing over who won this week’s model discourse, more trying to keep the workflow usable when the limits show up.

Test. Observe. Adapt. That is probably the most honest answer I have right now.

 

What we are trying → caveman  ·  lean-ctx

Let's talk about everything!

Thank you for contacting us!

We'll be in touch with you shortly.