What is AI actually good for?

I regularly speak to people who have a magical view of AI and want ‘Agents everywhere’. Whilst LLMs can do incredible stuff when properly harnessed, they can also produce the electronic equivalent of ‘hot air’ (total nonsense) whilst burning cash/electricity. Also, a lot of ‘Agents’ are basically just a good old-fashioned workflow, possibly with some LLM calls in there.
Hopefully this post gives you some clarity about how to think about where to deploy AI and where it might not be worthwhile.
Start with what an LLM actually is
An LLM is a next-word prediction engine. It reads the text in its context window (the system prompt, your prompts, and its responses) and predicts the most likely next token (roughly 4 characters, or about 0.75 words in English), then repeats.
That leads to two important consequences:
-
First, output is probabilistic, not deterministic.
-
A deterministic system, like a calculator, gives the same output for the same input every time.
-
A probabilistic system can produce slightly different outputs from the same input, and sometimes be confidently wrong.
-
-
Second, the LLM does not understand what it's saying. It is very good at producing text that sounds right, but whether it is right is a separate question, and the model cannot tell the difference.
The compounding error problem
This shows up in headlines like “the latest model scores 95% on benchmark X”. Sounds strong... until you chain five LLM steps together: extract, classify, reason, draft, review. At 95% reliability per step, end-to-end reliability is 0.95⁵, or about 77%.
You can improve this, but as with complex human processes, the answer is usually deterministic scaffolding and guardrails such as software tools and reviews.
Agents, mostly, are workflows with LLM calls in them
Outside coding, where the tooling is genuinely impressive, most so-called "agentic" systems are just workflows with one or more LLM calls. That's not a criticism: a well-designed workflow with a couple of well-placed LLM calls can be hugely valuable. It's just not the autonomous digital employee the demos suggest.
Strip away the marketing and the engineering questions are the same ones we've asked for decades: what's the process, what goes in and out at each step, what happens when something fails, and who's accountable? Process design and error handling matter even more with LLMs, because they're the least predictable part of the chain.
Why this matters for "production-grade" anything
To make an AI feature reliable enough for customers or unattended business use, you need a deterministic wrapper around a probabilistic model. That means:
-
Tight prompts with clear constraints.
-
Structured outputs (JSON, schemas), not free-form prose.
-
Output validation — does it match expectations?
-
Deterministic code for the workflow, error paths, and templates.
-
Human or multi-LLM review to maintain quality.
The LLM handles what deterministic systems can't: messy language, fluent drafting, and ambiguous classification. Deterministic code does the rest.
Managing LLMs is not free, and looks suspiciously like managing people
Here's a thought. Managing human talent is one of business's hardest jobs: getting people to do consistent, high-quality work at speed, with sound judgement. LLMs are built on neural networks modelled on human cognition, so why expect them to be much easier to manage than the humans they're imitating?
In practice, they aren't. They need clear instructions, guardrails, regular evaluation, and competent oversight. The same skills that make a good manager — clear briefs, attention to detail, and course-correction — also help deploy AI well, with the added need for technical fluency.
The bill, in tokens or in watts
AI has a marginal cost. With closed-source models like ChatGPT, Claude, or Gemini, you pay per token in and out. Multi-step agentic workflows are especially expensive: a query that triggers reasoning, tool calls, and retries can cost 10–20 times more than a one-shot completion. That may be worth it, but it's hard to model upfront because of compounding probabilistic error.
Open-source self-hosting doesn't remove the cost — it shifts it to hardware, electricity, and engineering time. Neither option is free, so the idea that AI is now "basically free" is misleading.
Jevons' paradox is coming for your hiring plan
William Stanley Jevons observed that when steam engines became more efficient, total coal use rose because coal-powered work became viable in more places. The same pattern may now be emerging with AI and engineering hiring.
The hope was that AI would reduce engineering headcount. In practice, companies using it seriously are hiring more engineers, because configuring, deploying, monitoring, and improving AI systems takes significant technical skill. Prompting, evaluation, retrieval, observability, and guardrails do not build themselves. The productivity gains are real, but they are creating more technical work, not less.
If your AI strategy assumes lower headcount, compare it with what peers are doing: you may cut roles in one area only to add them in another.
Time is the hidden line item
You can do almost anything with AI if you give it enough time. The problem is that almost nobody costs that time honestly.
Every hour spent tweaking prompts, testing outputs, debugging looping agents, or figuring out why a model stopped working is paid for somehow — by an employee whose other work slips, a contractor on day rates, or you at midnight.
That leaves a question many leaders haven't answered: will you train your existing team to use AI well, or hire more technical people who already can? Both are defensible. Doing neither and hoping AI just works is not.
If your business lacks deep technical knowledge, the learning curve is steep. "Anyone can use it" is true for ChatGPT in a browser, but the more complex the task, the more time goes into learning the technical work needed to make it happen.
Why coding is the exception
AI is genuinely strong at coding for a simple reason: code is text with strict syntax rules. Those rules can be learned, and the output can be tested deterministically — does it compile, do the tests pass? The feedback loop is tight, the right answer is knowable, and the model can iterate until it gets there.
Most other domains are different. A "good" marketing email, a "correct" legal summary, or a "right" strategic recommendation are judgement calls, and the feedback loop is often slow, subjective, or missing. The model can sound convincing, but unless you already know the answer, it's hard to tell if it's right. It's like the colleague who speaks confidently about things they don't understand — fluent, plausible, sometimes correct, and risky if you trust them without checking.
Bottom line
AI is a great tool, but it is not magic, free, or a substitute for process design, error handling, and good management. The companies getting value from it treat it as a powerful but inconsistent component to engineer around, and only unleash it as an autonomous agent with strict controls in place as one would for employees.
If you're a business leader wondering where to start:
-
Pick a narrow, well-defined problem where the cost of being wrong is low and the value of being right is clear.
-
Be honest about marginal cost — tokens, compute, and the human time to build and maintain what you ship.
-
Design the deterministic scaffolding around the LLM, not the other way around.
-
Assume you'll need technical skill in-house to do this properly, whether you train it or hire it.
-
Don't confuse a demo with a system. A demo looks cool; a system survives real users and customers.
If you’re trying to make sure your team are well setup to take advantage of the benefits of AI, we can help. Reach out below.👇