Number7AI — Docs

Extraction failure modes taxonomy

The critical question is not whether an IDP ever fails — it always does on some document. The question is whether failures are silent (wrong output posted as correct) or visible (flagged, routed, corrected). This taxonomy maps the major classes.

Last updated: April 2026

TL;DR

  • AP failures split into four classes: structural, contextual/numeric, identity, and boundary.
  • The highest-risk class is silent wrong output — numbers in the right position but wrong semantic field.
  • Vanity accuracy metrics (e.g. "99% accuracy") mask silent error risk. ERP-safe pass rate is the real number.
  • Good systems route every failure to auto-correct, human review queue, or explicit rejection — never silent posting.

Why silent failures matter more than obvious ones

When a system clearly fails — blank output, parse error, obvious garbage — a reviewer catches it immediately. The dangerous failures are confident wrong answers: unit price and quantity transposed, tax line dropped, vendor identity mismatched to a ghost record. These pass through automated queues and surface only at period close, audit, or when a vendor chases payment.

Designing for failure visibility is more important than headline accuracy. A system that flags its own uncertainty is safer than one that is confidently wrong.

Structural failures

  • Multi-row line fragmentation

    A single product line spans 2–3 rows; the parser treats each row as a separate item, producing phantom duplicates and wrong totals.

  • Multi-page table continuity

    Table headers appear on page 1 only; continuation rows on page 2 lose column context and are mapped incorrectly or dropped.

  • Nested table flattening

    Sub-tables (e.g. GST breakdown inside a line cell) are flattened into the wrong parent column, corrupting both the line item and the tax field.

Contextual / numeric failures

  • Indian lakh notation misparse

    1,00,000 is parsed as 100,000 or 1,00,000.00 depending on locale assumption. Silent wrong amount reaches ERP.

  • Format artifact in price fields

    Scanned rupee symbols produce 4200/- or values like 00:80 where OCR misreads the decimal. Passes as a valid number.

  • Tax rate without base amount

    Invoice shows CGST 9% but not the taxable base. System either infers the base incorrectly or silently drops the tax line.

  • Mixed date formats

    01/02/26 could be DD/MM/YY or MM/DD/YY depending on vendor origin. Wrong parse leads to wrong posting period.

Identity failures

  • Trade name vs legal entity

    Vendor invoice says 'ABC Traders' but the accounting master has 'ABC Traders Pvt Ltd'. Unmatched record blocks posting or creates a ghost vendor.

  • GSTIN format variations

    Spaces, dashes, lowercase — GSTIN on invoice does not match master due to formatting, not actual mismatch.

  • Remit-to vs bill-from divergence

    Multi-entity vendors invoice from entity A but payment should go to entity B. No structural signal in the document.

Boundary failures (bulk PDFs)

  • Cover page treated as invoice

    Email print-to-PDF includes a cover page before the invoice. System treats it as the first page of a document, shifting all fields.

  • Same-vendor invoices merged

    Two invoices from the same vendor land in adjacent pages. Weak boundary detection merges them into one record with combined totals.

  • Multi-invoice PDFs out of order

    40-page PDF with 12 invoices in arbitrary order. Boundary detection must identify start/end without relying on consistent headers.

Observed residual failure rates

Rates from production AP workflows on Indian documents. "Residual" means after IDP processing — these are the failures that reach the exception queue or (worst case) posting.

Failure classObserved residual rangePrimary trigger
Multi-row table continuity~1.5–2%Long descriptions, uneven scan quality
Nested table semantics~3–4%3+ level nested structures
Locale / format numeric<0.5% → ~5%Missing GSTIN or legacy format artifacts
Tax amount inference~8%Rate present but base amount unclear
Vendor identity mismatch~2–5%Trade name vs legal entity, GSTIN formatting
PDF boundary confusion~1–3%Cover pages, same-vendor adjacency

How AIdaptIQ routes exceptions

Every document outcome falls into one of three paths — nothing is silently posted if a failure is detected:

  • Auto-corrected

    Known patterns (lakh notation, rupee symbols, GSTIN formatting) normalized and logged in the audit trail with correction reason.

  • Flagged for review

    Low-confidence fields highlighted in the review UI with context. Human corrects and approves before any ERP push.

  • Rejected with context

    Document returned with a specific error explanation — not a generic failure. Reviewer knows exactly what to fix on resubmit.