Number7AI — Docs

Multi-invoice boundary detection

AP teams routinely scan and upload batches of invoices as single PDFs. Boundary detection is the pre-extraction step that determines where one invoice ends and the next begins — before extraction logic ever runs.

Last updated: April 2026

TL;DR

  • Boundary detection runs before extraction — it determines which pages belong to the same invoice.
  • Four signal types: invoice-start cues, invoice-end cues, page-type classification, vendor consistency.
  • Handles shuffled pages, multi-page invoices, cover pages, and identical-template vendors.
  • Low-confidence groupings go to a manual split/merge review queue rather than being silently processed.

Why boundary detection matters for AP accuracy

Incorrect boundary decisions compound downstream. A missed boundary means two invoices are extracted as one — totals merge, vendor identities blur, and duplicate detection logic is confused. An over-split boundary means a two-page invoice gets extracted twice, often with missing line items on each half. Both errors are harder to detect than a simple extraction field error, because they don't show up in per-field confidence scores.

Missed boundary (under-split)

Two invoices extracted as one. Grand total is wrong by definition. Vendor fields may be from the second invoice only.

False boundary (over-split)

Single invoice split into two partial extractions. Line items incomplete on both. Reconciliation fails on both.

Detection signals

  • Invoice-start cues

    Visual and structural signals that indicate a new document begins: header layouts, vendor logo zones, invoice number positions, and 'INVOICE' / 'TAX INVOICE' keyword anchors.

  • Invoice-end cues

    Total-block detection (grand total, bank details, signature blocks) signals a document terminus. Pages after a terminal block are classified as belonging to a new document.

  • Page-type classification

    Each page is classified independently as: cover, body, continuation, attachment, or unrelated. Classification informs boundary decisions before grouping runs.

  • Vendor consistency checks

    Consecutive pages must share consistent vendor identity signals (name zone, GSTIN position, template style). Vendor breaks trigger a boundary candidate.

Edge cases handled

  • Shuffled page order

    Pages from two different invoices interleaved. Detected via vendor-break and layout-discontinuity signals. Pages are re-grouped by predicted document membership.

  • Multi-page single invoice

    Invoices with continuation pages (line-item overflow). Page continuity signals prevent false splits between page 1 and page 2 of the same invoice.

  • Cover pages and attachments

    Scanned batches often include PO covers, remittance slips, and delivery challan pages. These are classified as non-invoice and excluded from extraction.

  • Identical-template vendors

    When two vendors use identical ERP-generated templates, GSTIN and header-zone content are used to maintain correct boundary assignments.

Low-confidence fallback

When boundary confidence is below threshold — typically caused by very low scan quality, missing vendor headers, or unusual document structures — the batch is held in a manual split/merge queue. Operators see a page-level thumbnail view and can drag pages between invoice groups before releasing to extraction. This prevents silent errors from ever reaching the extraction layer.