<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=3960532&amp;fmt=gif">
PredictAP Blog

Claude: The Demo Works. That's the Problem!

There's a moment happening in boardrooms and Slack channels everywhere right now. Someone shares a screen, pastes something into Claude, and in seconds gets back a result that would have taken a team weeks to build two years ago. And then someone says it.

"Why are we paying for that? We could just build it."

Sometimes they're right. But the cases where they're wrong are getting more expensive by the month, and the damage is subtle enough that most organizations won't see it coming until they're 18 months and a significant budget overrun deep into a system that half-works.

The better the demo looks, the more careful I get.

Build vs. Buy Has Always Been a Real Question

This isn't new. Every generation of technology lowers the bar for certain categories of software and triggers a fresh round of "why are we buying this?"

Dan Hockenmaier's recent essay, The Software Shakeout: What Is Durable and What Is Not in the Age of AI, put it well, citing Steven Sinofsky: "Whatever the world thought would end just ended up being vastly larger than anyone thought. And the thing that people thought would forever be replaced was not simply legacy but ended up being a key enabler."

The bar for building has genuinely come down for a real class of problems. If your use case is self-contained, the right answer is mostly visible in the data in front of you, and if the failure modes are low-stakes, you should probably build. Internal reporting dashboards, document scrapers and OCR pipelines, first-pass classification tools, simple data aggregators, draft generation for routine communications. The easy demo for these problems is easy for a reason, and the production version isn't that much harder.

The mistake is assuming this logic extends everywhere.

The Easy Demo Is a Trap

Large language models are extraordinarily good at appearing to solve problems they haven't actually solved. Most demos prove capability. Production tests accountability.

You build a proof of concept in a weekend. It handles 80% of cases beautifully. Leadership is impressed. You get the green light.

What you demonstrated is that the model reasons well when the answer is in the room, when the correct output can be inferred from what's in the prompt, the document, or a clean lookup. What you did not demonstrate is whether your real problem is that kind of problem.

This is the science project phase. And the science project phase is where a lot of internal AI builds live permanently, even when everyone insists it is almost there.

Knowledge-Based Work Is a Different Animal

There is a category of work where the right answer is not contained in the data in front of you. It lives in organizational memory, in years of accumulated judgment, in exceptions and precedents that shaped how the business actually runs versus how it's supposed to run on paper.

A few examples make this concrete.

A customer asks for a refund. Policy says no. But leadership precedent says yes if churn risk is high, the customer is strategic, and the issue was caused by a known outage. None of that context is reliably present in the ticket. A confident "no" is not just wrong. It is expensive.

A vendor looks approved on paper. But an internal incident last year created an informal rule requiring additional review for this category, except for renewals under a certain threshold. That rule exists in people, not in the dataset.

A transaction matches the usual pattern. But this quarter the business restructured, and the normal treatment is now wrong unless you know the reason behind the change.

Or take real estate. A multifamily investor asks about per unit rehab cost. A general model interprets that as unit-level economics: appliances, fixtures, finishes. What the investor means is per door cost across the entire project, the standard way the industry aggregates rehab budgets. Same words, completely different question. A purpose-built platform built for that vertical knows the difference on day one because the domain knowledge is baked in. A general model gets it wrong confidently, and the operator doesn't always catch it because the answer looks reasonable on the surface. Multiply that by every piece of industry-specific terminology, every shorthand, every calculation convention that professionals use without thinking, and you begin to understand the real scope of what "training the context" actually means. A modern software firm serving that vertical has spent years mapping all of it out so the customer gets a usable answer, not a plausible-sounding one.

And this problem runs deeper than terminology. Consider occupancy. Real estate as an industry has spent decades arguing about what that word means. Is it physical occupancy, the share of units where someone is living? Economic occupancy, the share of gross potential rent actually being collected? Leased occupancy, units with a signed lease regardless of move-in date? Each definition produces a different number from the same property, and each matters for a different decision. Lenders care about one. Asset managers care about another. Operators care about a third. A general model asked about occupancy will give you an answer. It almost certainly will not ask you which definition you need, and it may not even know the question is ambiguous. Purpose-built software in this space has had to resolve exactly these definitional battles, encode the right answer for the right context, and surface it correctly based on who is asking and why. That is not a prompting problem. That is years of product work distilled into a system that actually knows what you mean.

In these workflows, prompting can help, but it cannot replace institutional knowledge. And this is exactly where internal builds fail quietly while the demo looked great.

The right answer can contradict the surface data. The model doesn't know what it doesn't know, and unlike a well-designed specialized system, it won't reliably tell you when to stop or escalate. At low volume you catch the errors. At scale you don't, and that's when it stops being a technical problem and becomes a business one. The edge cases your proof of concept missed are also, almost always, the highest-stakes outcomes: disputes, write-offs, compliance exposure, customer churn.

And learning is infrastructure, not prompting. A knowledge-based system must get smarter from corrections without introducing new errors elsewhere. That means capturing the correction, validating it, propagating it safely, and being able to audit and roll it back. This is less about model cleverness and more about engineering discipline most internal teams never planned for.

The Compounding Value Problem

This is where Dan's framework connects directly to the build-vs-buy decision.

He argues that the software companies proving most durable in the AI era are those whose products compound with scale. More data makes the product smarter, network effects deepen, models improve with every customer interaction. As he writes, historically the greatest value accrued to companies that "scaled networks, improved the structural economics of an industry with data, took on outsized risk, and navigated legal and regulatory complexity." That's the world we're headed back to.

This is precisely what knowledge-based systems require and precisely what's hardest to replicate from scratch. When a purpose-built vendor has refined their system across thousands of customers and millions of edge cases, the gap between their product and your internal build isn't the code. It's the accumulated intelligence the code runs on. You start at zero. They've been compounding for years.

Dan also surfaces a pointed observation from Aaron Levie on accountability. When something goes wrong with a mission-critical system, and it will, you want a vendor you can hold responsible. You cannot sue your internal IT team. You certainly cannot sue Anthropic.

And Then There's Maintenance

Even when the initial build goes well, maintenance is where most internal AI projects quietly collapse.

Your business changes. Your data changes. Your policies evolve. Integrations break. The underlying model gets updated. Edge cases accumulate. What worked at launch starts to drift, sometimes gradually, sometimes overnight.

At that point you are no longer building a feature. You are running a permanent program: monitoring, evaluation, governance, continuous improvement, and incident response. If that ownership is not explicit from day one, the project will degrade. Not dramatically. Subtly. Until the organization stops trusting it.

Dan's framework distinguishes between products that compound with scale and products that erode without sustained investment. Internal builds almost universally erode because the team moves on, the business changes around the system, and no one owns the feedback loop.

The Right Questions

The question was never whether you can build it. With today's tools you can build almost anything to proof-of-concept level. The bar has come down. That part is real.

The right questions are whether the correct answer is contained in the data or whether it depends on organizational memory and evolving context. What happens in the tail, the edge cases and high-stakes exceptions that never showed up in the demo. How you will detect confident wrong answers and audit and reverse them safely. Who owns the system a year from now when the model drifts, the workflow changes, and the original builder has moved on. And whether the system compounds with scale or erodes without sustained investment.

Claude is a remarkable reasoning engine. But a reasoning engine is not the same thing as a production system for knowledge-intensive work.

The demo is the easy part. The hard part is everything the demo does not show.

Where have you seen the tail kill an internal build?

Link to Dan's full essay: https://www.danhock.co/p/the-software-shakeout-what-is-durable?r=pm2zu&utm_medium=ios&triedRedirect=true