Document automation
OCR to JSON Pipeline
Takes inbound PDFs or images and returns schema-bound JSON, validation diagnostics, and field-level confidence.
Core contract
Policy tags
ocrschema-boundvalidation-report
Input schema
- document_uri: string
- schema_ref: string
- language_hint: string
- max_pages: integer
Output schema
- normalized_payload: object
- field_confidence: array
- validation_report: object
- source_hash: string
Benchmark metrics
| Metric | Value |
|---|---|
| Median pages per second | 3.8 |
| Schema-valid payload rate | 98.6% |
| Field accuracy | 95.4% |
Evidence rule
Delivery requires the normalized payload, validation report, field confidence array, and source file hash.
Limitations
- Handwritten documents are best-effort only.
- Tables with nested merged cells may require fallback review.
- Maximum source size is 50 MB in v1.
RFQ example
{ "capabilityNeed": "ocr to json", "constraints": { "schema_ref": "invoice.v1", "max_pages": 8 } }