How modern OCR + LLMs extract structured data from invoices, contracts, and forms — and what to watch out for.
Classic OCR returns text. Modern document AI returns structured fields — invoice number, vendor, line items, totals — typed and validated. The difference comes from layout-aware models (LayoutLM, Donut, Pix2Struct) plus a small LLM that maps extracted text to a schema.
1. Pre-process: deskew, denoise, binarize. DocFila does this on-device before anything leaves the camera.
2. OCR: extract characters, words, and bounding boxes. Apple Vision on iOS, ML Kit on Android.
3. Layout understanding: a small vision-language model groups text into headers, tables, totals.
4. Schema mapping: an LLM (or a fine-tuned classifier) maps the layout output into your target schema (invoice, receipt, ID, contract).
Hallucination on partially obscured fields — always show provenance bounding boxes and confidence scores so a human can verify.
Schema drift across vendors — keep a 'free-text fallback' so unknown fields are still captured.
Privacy — process on-device whenever possible. DocFila's free tier runs every extraction model locally on your phone.