Number7AI — Docs

The 50,000 invoice problem

Before building AIdaptIQ we spent a year trying to make every other IDP product work on real production data. This is the story of what broke, why it broke, and what we had to build instead.

Last updated: April 2026

TL;DR

  • 50,000 invoices, 5,000+ layout variants — template-first approaches became maintenance debt, not product.
  • Every vendor we tested failed on some class of real documents: Indian GST tables, mixed-language lines, bulk PDFs.
  • The gap was not OCR accuracy — it was semantic structure, business validation, and format anomaly handling.
  • AIdaptIQ is the product we had to build because nothing else covered the full cycle: extraction → validation → posting → audit.

The context

What we were actually trying to do

The team behind AIdaptIQ came from a services business handling AP for mid-market Indian companies and BPO clients. The work was invoice-heavy: hundreds of vendors, many formats, a mix of scanned paper, emailed PDFs, and forwarded attachments.

At around 5,000 invoices per month, manual data entry became the ceiling on growth. We started evaluating IDP products. We were genuinely looking for something that worked — we wanted to buy, not build.

What broke and why

We tested every platform we could access on our own document set — not sanitised demos. The failures fell into a consistent pattern:

  • Template fragility

    Layout-first tools needed per-vendor templates. 5,000+ vendor layouts meant an unmanageable maintenance queue. Any change to a vendor's format broke existing templates.

  • Structural misparses

    Indian GST invoices with multi-row product descriptions confused column detection. Header rows were treated as data rows. Catalog fragments landed in quantity or model fields.

  • Script boundary failures

    Mixed English/Hindi line text: column boundaries failed at script transitions. The text could be 'correct' OCR output but from the wrong cell — a silent error.

  • Silent wrong output

    Numbers appeared in plausible positions but in the wrong semantic fields — unit price in the quantity column, line total as unit price. The output looked clean; the ERP import was wrong.

The gap

What was missing from every product

OCR accuracy was not the problem. Every platform could "read" the document. The failures were upstream of posting and audit — in the layer between raw text and an ERP-safe record.

  • Zero-template handling

    New vendor layouts should not require engineering time. Context-aware structure inference, not format-specific parsing paths.

  • Business validation

    Math checks, tax cross-references, format anomaly detection (Indian lakh notation, 4200/- artifacts, 00:80 in price fields) — before any ERP push.

  • Bulk PDF boundary detection

    Multi-invoice PDFs in arbitrary page order are the norm, not the exception. Boundary detection needs to be reliable, not bolt-on.

  • ERP-ready output

    Not a raw text dump. Validated, field-mapped, audit-traceable records ready for Tally, QuickBooks, or SAP — with clear exception routing for the rest.

What AIdaptIQ became

The build is not a generic IDP platform. It is focused on the specific failure modes we documented across real Indian AP workloads:

  • Indian business documents — GST tables, lakh notation, Tally-relevant output, mixed-language lines — as core design, not afterthought.
  • Multi-client and BPO economics: reduce per-vendor configuration drag so teams can scale without scaling headcount.
  • Template-light paths so new vendor layouts onboard without engineering queues.
  • Full AP operations direction: from intake and assignment through validation, exception collaboration, audit trail, and analytics — not extraction alone.