Claude: The Demo Works. That's the Problem!

by David Stifter on Mar 3, 2026 1:37:44 PM

There's a moment happening in boardrooms and Slack channels everywhere right now. Someone shares a screen, pastes something into Claude, and in seconds gets back a result that would have taken a team weeks to build two years ago. And then someone says it.

"Why are we paying for that? We could just build it."

Sometimes they're right. But the cases where they're wrong are getting more expensive by the month, and the damage is subtle enough that most organizations won't see it coming until they're 18 months and a significant budget overrun deep into a system that half-works.

The better the demo looks, the more careful I get.

Build vs. Buy Has Always Been a Real Question

This isn't new. Every generation of technology lowers the bar for certain categories of software and triggers a fresh round of "why are we buying this?"

Dan Hockenmaier's recent essay, The Software Shakeout: What Is Durable and What Is Not in the Age of AI, put it well, citing Steven Sinofsky: "Whatever the world thought would end just ended up being vastly larger than anyone thought. And the thing that people thought would forever be replaced was not simply legacy but ended up being a key enabler."

The bar for building has genuinely come down for a real class of problems. If your use case is self-contained, the right answer is mostly visible in the data in front of you, and if the failure modes are low-stakes, you should probably build. Internal reporting dashboards, document scrapers and OCR pipelines, first-pass classification tools, simple data aggregators, draft generation for routine communications. The easy demo for these problems is easy for a reason, and the production version isn't that much harder.

The mistake is assuming this logic extends everywhere.

The Easy Demo Is a Trap

Large language models are extraordinarily good at appearing to solve problems they haven't actually solved. Most demos prove capability. Production tests accountability.

You build a proof of concept in a weekend. It handles 80% of cases beautifully. Leadership is impressed. You get the green light.

What you demonstrated is that the model reasons well when the answer is in the room, when the correct output can be inferred from what's in the prompt, the document, or a clean lookup. What you did not demonstrate is whether your real problem is that kind of problem.

This is the science project phase. And the science project phase is where a lot of internal AI builds live permanently, even when everyone insists it is almost there.

Knowledge-Based Work Is a Different Animal

There is a category of work where the right answer is not contained in the data in front of you. It lives in organizational memory, in years of accumulated judgment, in exceptions and precedents that shaped how the business actually runs versus how it's supposed to run on paper.

A few examples make this concrete.

A customer asks for a refund. Policy says no. But leadership precedent says yes if churn risk is high, the customer is strategic, and the issue was caused by a known outage. None of that context is reliably present in the ticket. A confident "no" is not just wrong. It is expensive.

A vendor looks approved on paper. But an internal incident last year created an informal rule requiring additional review for this category, except for renewals under a certain threshold. That rule exists in people, not in the dataset.

A transaction matches the usual pattern. But this quarter the business restructured, and the normal treatment is now wrong unless you know the reason behind the change.

Or take real estate. A multifamily investor asks about per unit rehab cost. A general model interprets that as unit-level economics: appliances, fixtures, finishes. What the investor means is per door cost across the entire project, the standard way the industry aggregates rehab budgets. Same words, completely different question. A purpose-built platform built for that vertical knows the difference on day one because the domain knowledge is baked in. A general model gets it wrong confidently, and the operator doesn't always catch it because the answer looks reasonable on the surface. Multiply that by every piece of industry-specific terminology, every shorthand, every calculation convention that professionals use without thinking, and you begin to understand the real scope of what "training the context" actually means. A modern software firm serving that vertical has spent years mapping all of it out so the customer gets a usable answer, not a plausible-sounding one.

And this problem runs deeper than terminology. Consider occupancy. Real estate as an industry has spent decades arguing about what that word means. Is it physical occupancy, the share of units where someone is living? Economic occupancy, the share of gross potential rent actually being collected? Leased occupancy, units with a signed lease regardless of move-in date? Each definition produces a different number from the same property, and each matters for a different decision. Lenders care about one. Asset managers care about another. Operators care about a third. A general model asked about occupancy will give you an answer. It almost certainly will not ask you which definition you need, and it may not even know the question is ambiguous. Purpose-built software in this space has had to resolve exactly these definitional battles, encode the right answer for the right context, and surface it correctly based on who is asking and why. That is not a prompting problem. That is years of product work distilled into a system that actually knows what you mean.

In these workflows, prompting can help, but it cannot replace institutional knowledge. And this is exactly where internal builds fail quietly while the demo looked great.

The right answer can contradict the surface data. The model doesn't know what it doesn't know, and unlike a well-designed specialized system, it won't reliably tell you when to stop or escalate. At low volume you catch the errors. At scale you don't, and that's when it stops being a technical problem and becomes a business one. The edge cases your proof of concept missed are also, almost always, the highest-stakes outcomes: disputes, write-offs, compliance exposure, customer churn.

And learning is infrastructure, not prompting. A knowledge-based system must get smarter from corrections without introducing new errors elsewhere. That means capturing the correction, validating it, propagating it safely, and being able to audit and roll it back. This is less about model cleverness and more about engineering discipline most internal teams never planned for.

The Compounding Value Problem

This is where Dan's framework connects directly to the build-vs-buy decision.

He argues that the software companies proving most durable in the AI era are those whose products compound with scale. More data makes the product smarter, network effects deepen, models improve with every customer interaction. As he writes, historically the greatest value accrued to companies that "scaled networks, improved the structural economics of an industry with data, took on outsized risk, and navigated legal and regulatory complexity." That's the world we're headed back to.

This is precisely what knowledge-based systems require and precisely what's hardest to replicate from scratch. When a purpose-built vendor has refined their system across thousands of customers and millions of edge cases, the gap between their product and your internal build isn't the code. It's the accumulated intelligence the code runs on. You start at zero. They've been compounding for years.

Dan also surfaces a pointed observation from Aaron Levie on accountability. When something goes wrong with a mission-critical system, and it will, you want a vendor you can hold responsible. You cannot sue your internal IT team. You certainly cannot sue Anthropic.

And Then There's Maintenance

Even when the initial build goes well, maintenance is where most internal AI projects quietly collapse.

Your business changes. Your data changes. Your policies evolve. Integrations break. The underlying model gets updated. Edge cases accumulate. What worked at launch starts to drift, sometimes gradually, sometimes overnight.

At that point you are no longer building a feature. You are running a permanent program: monitoring, evaluation, governance, continuous improvement, and incident response. If that ownership is not explicit from day one, the project will degrade. Not dramatically. Subtly. Until the organization stops trusting it.

Dan's framework distinguishes between products that compound with scale and products that erode without sustained investment. Internal builds almost universally erode because the team moves on, the business changes around the system, and no one owns the feedback loop.

The Right Questions

The question was never whether you can build it. With today's tools you can build almost anything to proof-of-concept level. The bar has come down. That part is real.

The right questions are whether the correct answer is contained in the data or whether it depends on organizational memory and evolving context. What happens in the tail, the edge cases and high-stakes exceptions that never showed up in the demo. How you will detect confident wrong answers and audit and reverse them safely. Who owns the system a year from now when the model drifts, the workflow changes, and the original builder has moved on. And whether the system compounds with scale or erodes without sustained investment.

Claude is a remarkable reasoning engine. But a reasoning engine is not the same thing as a production system for knowledge-intensive work.

The demo is the easy part. The hard part is everything the demo does not show.

This is the problem PredictAP was built to solve. Yes, you can build an OCR scraper that pulls totals, dates, and vendor details off an invoice in a weekend. That part is real. But scraping data off a page is not invoice processing. It is the first five minutes of it.

The real problem starts after extraction. How should this invoice be coded? Across which entities? Against which budget? Using what allocation logic? Does this vendor have a history of disputed charges? Is this a duplicate, or a legitimate second invoice for the same amount? Does this line item follow the standard treatment, or did something change last quarter that only three people in the finance team know about?

Solving that requires something fundamentally different from an OCR pipeline. It requires a system that understands your organization. And building that internally means months of development, hundreds of thousands of dollars in training and optimization, and specialized ML engineering talent that most finance and operations teams do not have on staff. Even then, you are starting from zero with your own data, your own edge cases, and no baseline to learn from.

That is before you get to the ongoing work. Making outputs deterministic so the same invoice gets coded the same way every time. Building validation layers that catch confident wrong answers before they hit your GL. Maintaining integrations with ERPs, banks, and approval systems that update their APIs without notice. Retraining on corrections from finance teams without breaking what already works somewhere else. Monitoring for model drift so accuracy does not quietly erode after launch. Evaluating new foundation models as the landscape shifts so you are never locked into a provider that falls behind.

PredictAP has spent years solving each of these problems across hundreds of customers and millions of invoices. Every edge case refines the system. Every implementation compounds the intelligence. For the customer, it just works. Behind that simplicity is a permanent engineering program where every gap has a team behind it, operating under SOC 2 security controls designed for mission-critical finance workflows.

You can build a scraper in a weekend. You cannot build that.

So where should you point that building energy? At the problems that are yours alone. The reporting your board actually needs. The data quality issues unique to your portfolio. The internal tools that help your team move faster in ways no off-the-shelf product will ever address. That is where AI as a building tool shines brightest, on the problems where no purpose-built solution exists because nobody else has your specific problem. Use the best available software where the market has already solved the hard parts, and invest your build cycles where only you can create the answer.

Link to Dan's full essay: https://www.danhock.co/p/the-software-shakeout-what-is-durable?r=pm2zu&utm_medium=ios&triedRedirect=true

Topics: AI Best Practices

Claude: The Demo Works. That's the Problem!

Build vs. Buy Has Always Been a Real Question

The Easy Demo Is a Trap

Knowledge-Based Work Is a Different Animal

The Compounding Value Problem

And Then There's Maintenance

The Right Questions

Subscribe by email

Intelligent Invoice Coding™

Share this

Claude: The Demo Works. That's the Problem!

Build vs. Buy Has Always Been a Real Question

The Easy Demo Is a Trap

Knowledge-Based Work Is a Different Animal

The Compounding Value Problem

And Then There's Maintenance

The Right Questions

Share this

Subscribe by email

Intelligent Invoice Coding™