Number7AI — Docs
Multi-invoice boundary detection
AP teams routinely scan and upload batches of invoices as single PDFs. Boundary detection is the pre-extraction step that determines where one invoice ends and the next begins — before extraction logic ever runs.
Last updated: April 2026
TL;DR
- •Boundary detection runs before extraction — it determines which pages belong to the same invoice.
- •Four signal types: invoice-start cues, invoice-end cues, page-type classification, vendor consistency.
- •Handles shuffled pages, multi-page invoices, cover pages, and identical-template vendors.
- •Low-confidence groupings go to a manual split/merge review queue rather than being silently processed.
Why boundary detection matters for AP accuracy
Incorrect boundary decisions compound downstream. A missed boundary means two invoices are extracted as one — totals merge, vendor identities blur, and duplicate detection logic is confused. An over-split boundary means a two-page invoice gets extracted twice, often with missing line items on each half. Both errors are harder to detect than a simple extraction field error, because they don't show up in per-field confidence scores.
Missed boundary (under-split)
Two invoices extracted as one. Grand total is wrong by definition. Vendor fields may be from the second invoice only.
False boundary (over-split)
Single invoice split into two partial extractions. Line items incomplete on both. Reconciliation fails on both.
Detection signals
Invoice-start cues
Visual and structural signals that indicate a new document begins: header layouts, vendor logo zones, invoice number positions, and 'INVOICE' / 'TAX INVOICE' keyword anchors.
Invoice-end cues
Total-block detection (grand total, bank details, signature blocks) signals a document terminus. Pages after a terminal block are classified as belonging to a new document.
Page-type classification
Each page is classified independently as: cover, body, continuation, attachment, or unrelated. Classification informs boundary decisions before grouping runs.
Vendor consistency checks
Consecutive pages must share consistent vendor identity signals (name zone, GSTIN position, template style). Vendor breaks trigger a boundary candidate.
Edge cases handled
Shuffled page order
Pages from two different invoices interleaved. Detected via vendor-break and layout-discontinuity signals. Pages are re-grouped by predicted document membership.
Multi-page single invoice
Invoices with continuation pages (line-item overflow). Page continuity signals prevent false splits between page 1 and page 2 of the same invoice.
Cover pages and attachments
Scanned batches often include PO covers, remittance slips, and delivery challan pages. These are classified as non-invoice and excluded from extraction.
Identical-template vendors
When two vendors use identical ERP-generated templates, GSTIN and header-zone content are used to maintain correct boundary assignments.
Low-confidence fallback
When boundary confidence is below threshold — typically caused by very low scan quality, missing vendor headers, or unusual document structures — the batch is held in a manual split/merge queue. Operators see a page-level thumbnail view and can drag pages between invoice groups before releasing to extraction. This prevents silent errors from ever reaching the extraction layer.