PredictAP Blog

Why Your AI Gives a Different Answer Every Time (And Why That Matters More Than You Think)

Written by David Stifter | Mar 12, 2026 1:14:00 PM

The Word for This: Nondeterministic

There's a technical term for what you're seeing when AI gives different answers to the same question: nondeterministic.

A nondeterministic system can produce different outputs from the same input, even when nothing about the input changed.

That's the plain way of saying: every time you ask the question, you might get a different answer. And sometimes you might get an answer that sounds completely legitimate but is made up, like referencing a policy, a rule, or a "partner integration" that doesn't actually exist.

For brainstorming or writing, that variability is a feature. It gives you options.

But if you're using AI in workflows where the output becomes truth inside your business (invoice coding, compliance checks, customer commitments, financial reporting), nondeterminism isn't a quirky behavior. It's a risk you have to design around.

When Your Industry Speaks Its Own Language, the Risk Multiplies

Before we get into examples, it helps to understand why this problem hits certain industries harder than others.

Commercial real estate is full of words that mean one thing in everyday English and something entirely different inside your accounting reality.

Take "unit." On a supplies invoice, "unit" means unit price. In a multifamily context, "unit" means an apartment. A general-purpose model may pick one interpretation with total confidence and never tell you it was choosing between two different concepts. If it guesses wrong, the invoice gets routed to the wrong property, the wrong cost category, or the wrong budget line. And it still looks perfectly reasonable on the surface.

The same problem shows up with "occupancy," "recoveries," "NNN," "base year," "gross-up," "TI allowance," and dozens more terms that carry precise financial meaning inside CRE but read as generic vocabulary to a language model.

This is the compounding factor. Nondeterminism alone is manageable in some contexts. Nondeterminism combined with domain ambiguity is where things quietly go sideways.

A Simple Example Worth Sitting With

Last Tuesday you asked an LLM to classify an invoice and it answered: "Office Supplies, 6410." On Thursday you ran the exact same invoice and got: "General Administrative, 6500."

Both answers sounded confident. Neither one hesitated. And if you hadn't checked, you never would have known they disagreed.

This isn't a weird bug. This is how large language models work by default. If you're using AI anywhere the answer needs to be consistent, auditable, and repeatable, you need to understand what's happening under the hood.

The Slot Machine Under the Hood

Tools like ChatGPT, Claude, Gemini, or Copilot aren't looking up answers in a database. They aren't applying a fixed decision tree.

They generate output by predicting the most likely next word, then the next, then the next, based on patterns learned during training.

A helpful mental model: think of a weather forecast. A meteorologist doesn't say "it will rain tomorrow." They say "there's an 80% chance of rain." Language models work similarly. When you ask one to categorize an invoice, it's calculating probabilities across many plausible completions and selecting one. Run it again and tiny differences in the sampling process can change the path, even if the invoice is identical.

For creative work, that's the point. For your general ledger, it's a problem.

Where Inconsistency Quietly Breaks Things

The danger is rarely that the AI is obviously wrong. The danger is that it's plausible every time, so inconsistency stays invisible until it has already spread.

The invoice that codes itself differently every time. You process 10,000 invoices a month. You use an LLM to auto-assign GL codes. It works well, until your controller notices the same vendor is landing in three different accounts depending on the day. Department spend reports stop being trustworthy. Variance analysis becomes noise. Someone has to go back and clean months of history. Nothing "blew up." The outputs looked fine. They were just inconsistent. And finance hates inconsistency because it creates reconciliation work that doesn't need to exist.

The customer answer that contradicts itself. A client asks whether a service is covered under contract. On Monday, the AI-assisted tool says yes. On Thursday, it says no. Same question, same contract, different answer. The client screenshots both. That's not a prompt issue. That's a credibility issue.

The compliance check that can't make up its mind. You use an LLM to flag policy violations in expense reports. One borderline expense gets flagged. Another time, the same expense passes. Now your audit trail shows inconsistent enforcement, which is often worse than having no automated check at all.

The most expensive errors are the ones that look reasonable and slip through quietly.

The Confidence Problem: When AI Doesn't Know What It Doesn't Know

Inconsistency has a close cousin that may be even more dangerous: hallucination.

Raw language models don't have a reliable "uncertainty reflex." They don't consistently stop and say "I'm not sure." They generate the most statistically likely output, whether it's correct or not.

In an AP setting, this shows up in ways that sound almost silly until you see the downstream impact. The model references a policy that isn't in your handbook. It claims a vendor has a "partner integration" that doesn't exist. It confidently matches an invoice to the wrong PO because the description is unusual. And because the output sounds polished, people tend to trust it.

Here's a scenario that should keep AP leaders up at night. Your team starts using a general-purpose AI tool to match purchase orders to invoices. Most of the time it works. Then a vendor uses an unusual line item phrasing. The model confidently links the invoice to the wrong PO. The amount is close enough that nobody catches it. Three months later, vendor reconciliation uncovers misapplied payments and approvals, and now you're unwinding a chain of "reasonable-looking" decisions that were wrong from the start.

Why Better Prompts Aren't the Fix (And Why Building Your Own Guardrails Is Harder Than It Looks)

The first instinct is usually to fix this at the prompt level. Better instructions. More definitions. Lower temperature settings to reduce variability.

These help. They're worth doing. But they're dial turns on an architecture problem.

You can stuff prompts with definitions ("when I say unit, I mean apartment," "when I say recovery, I mean CAM recovery") and it works in a demo. It breaks at scale. You cannot anticipate every ambiguity across 10,000 monthly transactions, and the model won't reliably carry that context from invoice #1 to invoice #10,000.

The next instinct, especially for organizations with engineering resources, is to build guardrails in-house. Take a foundation model API (Claude, GPT, Gemini) and layer on validation, structured outputs, confidence routing, and domain-specific logic yourself.

This is a legitimate path. It's also a path that's easy to underestimate.

The validation layer sounds simple until you're maintaining rules against a chart of accounts that shifts quarterly. The structured outputs work until a vendor introduces a line item format you didn't anticipate. The domain context needs to be built from your actual vendor history, lease structures, and property hierarchies, not generic training data. And the feedback loops that make the system improve over time only work when they're connected to a high enough volume of similar transactions to surface patterns, not just your own portfolio in isolation.

None of this is impossible. But organizations that go this route consistently discover the same thing: the first 80% comes together fast and feels like magic. The last 20%, the part that makes it production-grade and audit-ready, takes five times longer than anyone budgeted for. And it never really stops requiring attention.

Prompt engineering is necessary. It's not sufficient. And building your own reliability layer is possible, but it's an ongoing engineering commitment that most finance teams shouldn't have to make.

What Production-Grade Reliability Actually Requires

Teams getting consistent, auditable outcomes aren't just "using LLMs better." They're building, or buying, systems that compensate for what raw LLMs aren't designed to guarantee: consistency, traceability, and domain correctness.

Here's what that looks like in practice.

Validation layers. Outputs get checked against constraints before anything touches a system of record. If your chart of accounts has 200 valid codes, the system rejects anything outside that set, regardless of how confident the model sounded.

Structured outputs. Responses get forced into a schema instead of free text. Don't ask "What GL should this be?" Instead: "Select from this controlled set of options and return the result in this exact format."

Deterministic routing. Many cases don't require probabilistic AI at all. If you've seen the same vendor pattern 500 times, the 501st should route through deterministic logic. Save AI for genuinely ambiguous cases.

Domain-aware context. Specialized language gets resolved using vendor history, property context, lease structures, and your accounting rules, not the model's best statistical guess about what "unit" means today.

Confidence thresholds and workflow design. High confidence auto-processes. Medium confidence queues for review. Low confidence escalates. That's how you get AI speed without AI risk.

Feedback loops that compound. Corrections must improve future outcomes concretely. Not "the model will get better someday" but "this vendor's invoices will be handled correctly going forward." And those loops get more powerful when they're informed by patterns across thousands of properties and millions of transactions, not just your own data.

This is the approach we've taken at PredictAP. AP automation doesn't need "usually right." It needs the same answer every time, grounded in your chart of accounts, your business rules, and the real language of commercial real estate, with a clean audit trail behind every decision.

A Framework You Can Apply Monday Morning

Not every AI use case needs this level of rigor. The key is knowing which ones do.

Variation is fine when a human reviews before it matters, the output is a draft or starting point, and inconsistency doesn't create downstream clean-up. Think brainstorming, writing, summarizing, internal research, first-pass memos.

Variation is dangerous when the output feeds a system of record, it impacts money movement or reporting or compliance, different answers across runs create reconciliation work, or the terminology is industry-specific. Think GL coding, categorization, policy enforcement, audit workflows, customer-facing commitments.

Three questions to pressure-test any AI workflow:

  1. If this AI gave a different answer tomorrow on the same input, would anyone notice?
  2. If they noticed, would it matter?
  3. Does this workflow involve words that mean something different in my industry than in everyday English?

If any of those give you pause, you don't just need a model. You need a system around it.

Looking Ahead

Models are improving fast. Each new generation gets better at reasoning, reduces hallucinations, and increases consistency.

But here's the operational reality that gets lost in the hype cycle: the gap between "great demo" and "production-grade reliability at scale" is still wide. Even a 97% accuracy rate means 3% exceptions. In reporting for accuracy dependent uses like accounting and finance, 3% is where trust breaks. It's where audit questions start. It's where someone spends a week unwinding three months of misapplied payments that looked perfectly fine on the surface.

The organizations that win with AI won't be the ones who adopt the newest model fastest. They'll be the ones who know when a general-purpose tool is enough, and when the workflow demands guardrails, domain context, and engineered reliability that took years and millions of transactions to build.

AI isn't magic. It's leverage. And the difference between leverage that builds value and leverage that creates expensive clean-up projects comes down to one thing: whether the system around the model was designed for your reality, or whether you're hoping the model figures it out on its own.