Extraction confidence isn't accuracy: how to actually use the…

Every document AI tool returns a confidence score. Most teams treat it as a stand-in for accuracy. It isn't. The two numbers measure different things, and using one as a proxy for the other is the most common mistake we see in onboarding calls.

This post explains what the extraction confidence score actually represents in Docusift, how it differs from accuracy, and how to pick a threshold without overthinking it.

What the number is

Every document we extract comes back with two confidence scores:

- classification_confidence — how sure the model is that this document is the type we said it is (invoice, receipt, bill_of_lading, etc.). - extraction_confidence — how sure the model is about the field values it returned, in aggregate.

Both are floats in [0, 1]. Both are the model's own self-rated probability, not a statistical accuracy guarantee.

That distinction matters. The model is saying "given what I see on this page, here's how often I'd expect to be right about my own answer." It is not saying "this is the historical accuracy of extractions on documents like this one." Those are different quantities, and only the second one is what you usually mean when you say "accuracy."

Confidence ≠ accuracy

Imagine 1000 invoices that all come back with extraction_confidence: 0.95. If we manually audit them, the actual accuracy could be 95% — or 88%, or 99%. The model's confidence and the empirical accuracy correlate, but they're not the same thing. A few reasons why:

1. Calibration drift. Models are trained on a data distribution. Your customer's documents are a sample from a different distribution. The numbers stay roughly meaningful but the calibration shifts. 2. Field-weight asymmetry. A 95%-confident extraction that gets the vendor name right but misses one of twelve line items still looks "95% confident." Whether that counts as accurate depends entirely on whether you care about line items. 3. The hard cases lie. When a document is genuinely ambiguous — handwritten in a field nobody's seen before — the model is sometimes overconfident and sometimes underconfident. The score is most reliable in the middle of the distribution and least reliable at the tails, which is the opposite of what you want.

How to actually use it

The pragmatic framing is: confidence is a routing signal, not a quality KPI. Use it to decide where a document goes, not to report how well the system is performing.

In Docusift, every workspace sets an auto-approve threshold (default 0.92). Documents above the bar auto-approve and sync to your accounting tool. Documents below the bar land in the Review queue, where a human eyeballs the PDF + editable fields side by side and signs off.

The threshold is the lever you tune. Three guidelines from what we see across customers:

Start at 0.92. It catches most edge cases for the average workspace without burying ops in review work. Watch the Review queue volume for a week.

If the Review queue is empty, lower the bar to 0.88. You're under-routing — the model is confident about edge cases that actually need a look.

If the Review queue is overflowing, look at the failures _first_, not the threshold. Often the docs landing in Review have a real systematic issue (poor scans, a malformed vendor template) that the model is correctly flagging. Fix the input, don't lower the bar.

Tuning the threshold is a knob, not a strategy.

What we report instead of "average confidence"

Internally we track three numbers as the actual quality KPIs:

- Auto-approve rate. Of all extractions in a window, what fraction crossed the threshold? Drift in this number tells you whether incoming document quality is changing. - Review queue clear rate. Of documents that landed in review, what fraction were edited (vs. approved as-is)? High edit rate means the model's "I'm not sure" was correct — the queue is doing its job. Low edit rate means the threshold is too conservative. - Correction frequency by field. Per-field counts of how often a reviewer overrode a value. This is the real quality signal — it tells you which fields the model gets wrong, on which document types, with what frequency.

Average confidence makes a fine sparkline. It does not make a quality dashboard.

What the threshold should never do

A few anti-patterns we've watched customers walk into and walk back out of:

Don't auto-approve at 0.99 to chase a "no human in the loop" headline. The 1% you skip review on are exactly the documents that need it most. The threshold's job is to find the band where human review pays for itself, not to eliminate review.

Don't tune per doc type without reason. Setting invoice threshold to 0.95 and receipt to 0.85 because "receipts are simpler" usually backfires — receipt OCR has more visual noise, not less.

Don't compare your confidence numbers to a competitor's. Different vendors' models are calibrated differently. A 0.92 from one tool is not a 0.92 from another. Compare empirical correction rates on the same document set instead.

The takeaway

Confidence scores are operational signals, not quality measures. The right interface they belong behind is a routing decision, not a status badge. Set a threshold, watch the auto-approve rate and the correction rate, and let the number do its actual job — sending the documents that need eyeballs to people, and the rest to your accounting tool.

If you'd like to see what your real document distribution looks like under our extractor, the free extraction audit returns the JSON, the confidence scores, and the threshold curve for your sample. No setup, no contract.