Number7AI — Docs

How to evaluate AP automation

Most AP automation platforms look identical in a 30-minute demo. The differences only appear when you test them on your real documents, your real edge cases, and your real exception volume. This guide gives you the framework to evaluate what matters.

Last updated: April 2026

TL;DR

  • Test on your own documents — not vendor-curated samples. Results on clean PDFs don't predict production STP.
  • Ask for STP rate from live deployments, not OCR accuracy from benchmarks.
  • Evaluate four dimensions: document complexity coverage, validation depth, exception handling, integration model.
  • The red flags list below will disqualify most demo-only platforms before you spend time on a POC.

The wrong question most buyers ask

"What is your accuracy?" is the wrong starting question for AP automation evaluation. OCR accuracy on clean documents is high for every platform. What matters is the straight-through processing rate on your real document mix — the percentage of invoices that complete the full workflow (extract → validate → post) without any human intervention. A platform with 99% field-level accuracy on clean PDFs might have a 40% STP rate on your actual scanned invoice batches.

Wrong question

"What is your extraction accuracy?" — vendors will say 95–99%+ and show curated demos.

Right question

"What is the STP rate from a deployment processing documents similar to mine?"

Four evaluation dimensions

Document complexity coverage

  • Does it handle multi-page invoices correctly (boundary detection)?
  • What happens with low-quality scans — does it fail loudly or silently?
  • Does it handle your specific document types (GST invoices, GRNs, mixed formats)?
  • What is the tested vendor template variability?

Validation depth

  • Does it verify invoice math (line items × quantity = subtotal, etc.)?
  • Does it check vendor identity against a master, or just extract the name?
  • Is three-way matching (PO/GRN/invoice) supported natively or via API?
  • What happens when a validation rule fails — halt, flag, or silent pass?

Exception handling

  • Is there a structured exception queue or just a list of 'failed' documents?
  • Can exceptions be triaged, prioritised, and resolved with audit log entries?
  • What is the average exception rate in production deployments (ask for data)?
  • Can exception resolution patterns be used to improve extraction over time?

Integration and deployment

  • Is the integration REST API, file-drop, or email-only?
  • What is the typical time-to-production for your ERP or accounting system?
  • Is reprocessing and rollback supported if a posting error is discovered?
  • What is the support model for integration issues in production?

Red flags to walk away from

  • Accuracy demonstrated only on clean, single-page PDFs.
  • No distinction made between extraction accuracy and STP rate.
  • Exception handling described as 'manual review' with no queue tooling shown.
  • Integration offered only via CSV export or manual copy-paste.
  • Audit trail limited to 'activity log' with no field-level history.
  • Pricing based on users, not document volume — often a signal of low automation confidence.
  • No live data from production deployments cited in benchmarks.

Running a meaningful POC

  1. 1

    Supply your own documents

    Provide 100–200 real invoices from your actual vendor mix, including the messiest ones. Refuse to run a POC on vendor-supplied samples.

  2. 2

    Measure STP, not accuracy

    Track how many invoices complete the full workflow without human touch. This is the number that maps to ROI.

  3. 3

    Test exception volume

    Count how many exceptions are raised and how long they take to resolve. High exception rates eliminate the productivity gain from automation.

  4. 4

    Evaluate the audit trail

    Ask to see the full history of a specific invoice — from upload to posting. If this takes more than 2 minutes to produce, the audit trail is insufficient.

  5. 5

    Ask for a reference customer

    Speak to a customer with a similar document mix and volume. Ask about the first 3 months, not just steady state.