Vision-language models for Somali ID OCR: a 2026 benchmark
How modern vision-language models perform on Somali ID OCR - benchmarking Claude, GPT-4o, Mistral, AWS Textract; code samples and the production stack we ship.
OCR for identity documents used to be a specialised problem. You trained or licensed a model on a corpus of cards for the country you cared about, you hand-engineered field templates, and you accepted that exotic layouts - like Somali ID variants - would underperform the OECD-country baseline. As of 2026, that's not true any more. Vision-language models (VLMs) - frontier multimodal LLMs from Anthropic, OpenAI, Mistral and Google - have collapsed the cost of doing competent OCR on documents the cloud OCR services never saw in training.
This article benchmarks the current frontier models on Somali ID extraction, with reproducible code in Python, and explains the production stack we ship at Ogowkey. The headline finding is that VLMs alone are not enough, but a VLM augmented with a thin layer of layout-aware post-processing now beats every specialised OCR vendor we've tested.
What we benchmarked
We assembled a private evaluation set of 214 Somali identity documents - a mix of:
- 142 NIRA national ID cards (issued by the National Identification and Registration Authority, 2023–2026 issuances).
- 41 Somali passports (post-2014 ICAO compliant, with MRZ).
- 31 regional driver's licences (Banadir, Puntland, Somaliland - three different layouts).
Each document has a ground-truth annotation for: surname, given names, document number, date of birth, sex, date of issue, date of expiry, and (where present) MRZ.
We measured character-level accuracy (Levenshtein-normalised) per field, plus overall hit rate (the percentage of documents where every field was extracted correctly within a 95% character-accuracy threshold). The latter is the metric that maps directly to real-world experience: it's the percentage of users who would not have to manually re-key any field.
Five systems benchmarked:
- AWS Textract - the
AnalyzeIDAPI. Mature, vendor-trained on global IDs. - Mistral OCR API - released early 2025, transformer-based, vendor-tuned for documents.
- GPT-4o (vision) - OpenAI's multimodal model, prompted for structured extraction.
- Claude Sonnet 4.5 - Anthropic's multimodal model with the same prompt structure.
- Ogowkey production - our shipped stack: a VLM (Claude Sonnet 4.5) with layout-aware post-processing, MRZ cross-validation and NIRA-format priors.
Results
A few things worth flagging:
- AWS Textract trails because its
AnalyzeIDAPI has structural priors for US/EU layouts. Somali NIRA cards confuse the field-detection step; the OCR per-character is fine but field assignment falters. - Mistral OCR does well because it doesn't try to be clever about field detection - it returns the raw lines, and we can map them ourselves. That's also what makes it slightly worse than VLMs, which understand the document semantically.
- GPT-4o and Claude Sonnet 4.5 are within a few points of each other on this task. Claude was marginally better at handling the Somali transliterations (Cabdullaahi → Abdullahi correctly mapped without us asking).
- Ogowkey's stack beats the raw VLM by 15 points because we add layout priors, MRZ cross-checks, and field validators (does the date parse? does the MRZ check-digit validate?). The VLM gives us a strong first pass; the surrounding system gives us defensible accuracy.
Code: extracting with a VLM
Calling a VLM for OCR is now a few lines. Here's the production-shaped Anthropic call:
import anthropic, base64, json, pathlib
client = anthropic.Anthropic()
PROMPT = """You are extracting structured fields from a Somali national ID card.
Return JSON with the following fields, no commentary:
{
"surname": str,
"given_names": str,
"national_id": str, // the 10-digit alphanumeric on the card
"date_of_birth": "YYYY-MM-DD",
"sex": "M"|"F",
"date_of_issue": "YYYY-MM-DD",
"date_of_expiry": "YYYY-MM-DD",
"address": str | null,
"mrz_line1": str | null,
"mrz_line2": str | null,
"mrz_line3": str | null
}
Rules:
- If a field is illegible, return null for that field.
- Always normalize dates to YYYY-MM-DD (the card uses DD-MM-YYYY).
- For names, preserve the spelling on the card (do not transliterate).
"""
def extract_id_fields(image_path: str) -> dict:
img_bytes = pathlib.Path(image_path).read_bytes()
b64 = base64.b64encode(img_bytes).decode()
msg = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {"type": "base64", "media_type": "image/jpeg", "data": b64},
},
{"type": "text", "text": PROMPT},
],
}
],
)
return json.loads(msg.content[0].text)
For OpenAI's GPT-4o, the call shape is similar - base64-encode the image, attach as an image_url content block, ask for JSON.
For Mistral's OCR API:
import httpx, base64, pathlib
with httpx.Client(headers={"Authorization": f"Bearer {MISTRAL_KEY}"}) as c:
img_b64 = base64.b64encode(pathlib.Path("id.jpg").read_bytes()).decode()
res = c.post(
"https://api.mistral.ai/v1/ocr",
json={
"model": "mistral-ocr-latest",
"document": {"type": "image_url", "image_url": f"data:image/jpeg;base64,{img_b64}"},
},
)
pages = res.json()["pages"]
raw_text = "\n".join(p["markdown"] for p in pages)
The Mistral response gives you the raw text in markdown - you then run your own field-detection layer on top.
Why the VLM alone isn't enough
The VLM gets you to ~78% hit rate on this benchmark. To get to production-grade (90%+), you need three additional layers.
1. MRZ cross-validation
The MRZ on the back of the card carries the same data as the visual zone, in a machine-readable format with check digits. If the VLM extracted date_of_birth: "1992-08-03" and the MRZ encodes 920803, you have agreement. If they disagree, you have a problem worth surfacing.
def parse_mrz_date(mrz_date: str) -> str:
"""YYMMDD -> YYYY-MM-DD. Assumes years 00-29 are 21st century, 30-99 are 20th."""
yy, mm, dd = int(mrz_date[:2]), int(mrz_date[2:4]), int(mrz_date[4:6])
yyyy = 2000 + yy if yy <= 29 else 1900 + yy
return f"{yyyy:04d}-{mm:02d}-{dd:02d}"
def mrz_check_digit(value: str) -> str:
weights = [7, 3, 1]
total = 0
for i, c in enumerate(value):
if c == "<": v = 0
elif c.isdigit(): v = int(c)
else: v = ord(c.upper()) - 55
total += v * weights[i % 3]
return str(total % 10)
If the check digits don't match, you treat the result as low-confidence regardless of the VLM's apparent accuracy. We covered the MRZ specification in verifying Somali national IDs.
2. Layout-aware field priors
The VLM sometimes confuses date of issue with date of expiry on cards with idiosyncratic layouts. A simple sanity check - issue must be in the past, expiry must be in the future, expiry must be after issue - catches most of these:
from datetime import date
def validate_dates(fields: dict) -> list[str]:
errors = []
today = date.today()
try:
dob = date.fromisoformat(fields["date_of_birth"])
if dob >= today:
errors.append("date_of_birth_in_future")
except (KeyError, ValueError, TypeError):
errors.append("date_of_birth_invalid")
try:
iss = date.fromisoformat(fields["date_of_issue"])
exp = date.fromisoformat(fields["date_of_expiry"])
if exp <= iss:
errors.append("expiry_before_issue")
except (KeyError, ValueError, TypeError):
errors.append("issue_or_expiry_invalid")
return errors
3. NIRA-format priors
The Somali national ID number has a known prefix structure (region code, sequence, check). A VLM extracting "8989O12345" - with an "O" instead of zero - can be corrected by a validator that knows the field is all-digits.
import re
NIRA_ID_RE = re.compile(r"^[0-9]{10}$")
def normalize_nira_id(s: str) -> str:
# Common OCR confusions in numeric ID fields.
return s.replace("O", "0").replace("I", "1").replace("S", "5").strip()
def validate_nira_id(s: str) -> bool:
return bool(NIRA_ID_RE.match(normalize_nira_id(s)))
These three layers, run after the VLM, account for the 15-point gap between Claude Sonnet 4.5 alone and our shipped stack.
Cost and latency
| System | Avg latency (warm) | Cost / 1k extractions |
|---|---|---|
AWS Textract AnalyzeID |
1.1s | $1.00 |
| Mistral OCR | 0.9s | ~$0.30 |
| GPT-4o vision | 1.8s | ~$1.20 |
| Claude Sonnet 4.5 vision | 1.6s | ~$1.40 |
| Ogowkey production | 1.2s | included in /v1/ocr/extract |
VLM costs are coming down quarter on quarter. The break-even point against specialised OCR vendors has already crossed for any sufficiently exotic document layout - which Somali IDs qualify as.
What we run in production
A simplified view of the Ogowkey OCR pipeline:
The split fan-out at the pre-process stage runs the VLM and the MRZ parser in parallel - there's no dependency between them - so the bottleneck is the slower of the two (the VLM, by 200–400 ms). The validators consume both outputs and produce the final decision JSON.
Practical recommendations
If you're building OCR for Somali IDs in 2026:
- Don't write the model yourself. Use a frontier VLM, attach structured-output prompting, ship in a week.
- Always parse the MRZ in parallel. It's the single best ground-truth signal for the data on the card.
- Validate ruthlessly. Date parsers, check digits, regex on the ID number. Most VLM errors are caught by these.
- Cache nothing in the live OCR path. ID images are PII; the costs of a leak outweigh the latency win of caching.
- Run a private eval set quarterly. Frontier models change weekly; your hit rate will drift. Re-baseline.
If you'd rather not build this yourself, our identity verify endpoint bundles the OCR with face match, liveness, MRZ parsing and AML screening into one call. See the playground for a live test.
Closing
The cost of doing high-accuracy OCR on exotic documents collapsed in 2025. The remaining engineering work isn't model quality - it's the surrounding scaffolding that takes a strong VLM and turns it into a defensible verification result. Get the validators right and the rest follows.
For the document spec this all rides on, our Somali national ID verification is the canonical reference. For the NIRA integration that turns OCR'd fields into a verifiable claim, see Integrating with NIRA.