Structured outputs, vision models, and the boring engineering…

The headline pitch of any document AI tool sounds the same these days: a multimodal model reads the page, returns JSON, done. That is the 20% that demos well. The 80% that decides whether the system survives in production is plumbing — the kind of engineering nobody puts in a sales deck.

This post is about that 80%, drawn from what we have actually built into Docusift's extraction pipeline.

The model is a black box that lies sometimes

A vision-capable model is genuinely good at reading documents. It is not a deterministic compiler. The same PDF, sent twice, can return slightly different JSON — different float precision, different ordering of array elements, occasionally a hallucinated field. The first job of the pipeline is to wrap that nondeterminism in something stable.

Three layers do this work:

1. Strict schema enforcement. Every doc type has a JSON schema. The model's response is parsed against it. Extra fields are stripped. Missing required fields trigger a retry, not a partial response. The output your downstream pipeline sees is the same shape every time, even when the model wobbles.

2. Provider abstraction behind a single interface. A document AI service that ties itself to one model provider is one outage away from a bad afternoon. We route extractions through a stable internal interface, log every call's prompt, response, latency, token counts, and model identifier, and can fail over or rebalance without touching application code.

3. Confidence as a first-class output. The model's own self-reported confidence is captured alongside the extraction. We do not overwrite or "smooth" it — downstream policy decides what to do with it (auto-approve, send to Review, etc.).

Retry semantics that do not compound failures

Naive retry logic is "if the call fails, try again." In production this is wrong in two directions:

- Retrying a 4xx is throwing money away — the request was malformed, retrying does not help. - Retrying a 429 immediately is also wrong — the rate limit is signaling backoff, not transience.

The right retry policy looks something like:

- 5xx and network errors: retry up to 3 times with exponential backoff and jitter. - 429 rate limit: respect the Retry-After header. If absent, fall back to exponential backoff with a healthy floor. - 400/422 schema errors: no retry. Capture the validation error, surface it on the document, route to Review. - Timeouts: retry once. A second timeout is a hard failure — do not pile up zombie requests.

Every retry attempt is recorded with its latency and outcome. Documents that never succeed land in Review with the failure reason attached, not silently dropped.

Idempotency or you regret it later

Webhook delivery, retry queues, and replays all converge on the same need: idempotency. We handle it at three levels:

Document level. Every document is identified by a content hash of the file bytes. A second upload of the same bytes returns the existing document instead of creating a duplicate. This is what protects you from a customer's well-meaning "let me re-upload that, the first one did not seem to go through" pattern.

Webhook level. Outgoing webhook payloads include the document id and an event type. Receivers should dedupe on (id, event). We retry up to 3 times on non-2xx responses, but the receiver should treat retries as duplicate notifications, not fresh events.

Job level. The internal job queue uses idempotency keys, so a job can be enqueued twice and run once. This matters when an upstream service retries an upload and the same document gets two extraction jobs queued — the second one no-ops.

Cost attribution is a feature, not an accounting concern

Every model call hits a budget. With per-workspace pricing, you have to know which workspace ate which call. For each AI call we record:

- the workspace - the model and pricing tier used - prompt and completion token counts - the resulting cost in cents - latency - a request identifier for tracing

The admin analytics view rolls this up to per-workspace cost. Without it, a runaway workspace on a free tier can quietly burn through your model budget for weeks before anyone notices.

Classify before you extract

The pipeline order matters more than people think. We classify before we extract, for two reasons:

1. Different doc types have different schemas. Trying to extract invoice fields from a bill of lading wastes a model call and produces garbage. 2. The classification call is cheap. A small vision pass over the page returns the doc type with high confidence in milliseconds. The expensive call (full structured extraction) only runs when we know what we are extracting.

Both calls go through workspace-scoped storage first. Documents are uploaded once, and both classification and extraction read from the workspace's storage backend. Models never receive a URL we do not control.

Webhook signing is not optional

When Docusift POSTs a webhook to a customer, we sign the body with HMAC-SHA256 keyed off a per-workspace secret. Receivers verify the signature in constant time before trusting the payload — a naive equality comparison leaks one bit of timing per character, and over enough requests an attacker can forge a valid signature. Constant-time comparison closes that door.

The mechanics are documented for integrators in the API reference. The point is that signature verification is not a nice-to-have — it is the difference between accepting webhook events and accepting webhook events from an unauthenticated source on the public internet.

The unglamorous part is the moat

Sales decks talk about model accuracy. The real reason customers stick around is the boring stuff — that retries do not compound, that idempotency keeps duplicates out of their accounting tool, that cost attribution lets ops catch a runaway workspace before the bill arrives, that webhook signatures do not leak timing.

This is the part of document AI that is mostly engineering, not ML. It is also the part most early competitors skip — and the part you only get to skip once.