Face match accuracy across skin tones: a practical guide for fintechs

If you operate a face-match-based KYC flow and you have not specifically tested it across the demographic you actually serve, you have a problem you don't know about yet. Face-recognition accuracy is not uniform across skin tones, ages, lighting conditions or capture devices. For a fintech serving East African customers, the gap between "vendor-claimed accuracy" and "accuracy on your real users" can be the difference between a smooth onboarding and a sign-up funnel that haemorrhages legitimate users.

This guide is the practical version. We cover what the research actually shows in 2026, why the gaps exist, how to test for them, and what to do operationally when you find one.

What the research shows

The headline finding from the NIST FRVT programme - which has tested commercial face-recognition algorithms continuously since 2017 - is that modern algorithms have closed most of the demographic gap, but not all of it. Specifically:

Top-tier algorithms (the leading commercial vendors) now show single-digit percent differences in false-non-match rates across demographic groups, where five years ago the differences were 10–100×.
The gap persists most stubbornly in low-quality images: poor lighting, low resolution, motion blur. Demographic effects compound with capture quality.
False-non-match (FNMR) - failing to match the same person across two photos - is the harder fairness problem in KYC, because it locks legitimate users out.
False-match (FMR) - wrongly matching two different people - is the harder fairness problem in surveillance, because it implicates the wrong person.

For a fintech, the operational problem is almost always FNMR. A genuine customer presents their NIRA card, takes a selfie, and the system fails to confirm they're the same person - so they can't open an account. If that failure rate is 1% across your customer base but 4% specifically for darker-skinned women in low light, you are unintentionally building a discriminatory product.

Why the gap exists

Three causes, in roughly descending order of importance for current algorithms:

Training-data composition

If the algorithm was trained on a dataset where African faces are 1% of the images and European faces are 80%, the model learns to discriminate the latter better. Most commercial models are now trained on more balanced datasets than they were in 2018, but no public benchmark fully reflects East African demographics specifically. Algorithms tuned on FERET and LFW are over-fitted to a different population than yours.

Imaging conditions

This one is the more under-appreciated factor. The signal-to-noise ratio for a darker face captured by a CMOS sensor with auto-exposure tuned to a different reference is genuinely worse - there's less differentiation across the face, less detail in shadow regions, more compression artifacts in dark areas after JPEG encoding. The algorithm isn't biased per se; it's working from a noisier input.

The fix isn't entirely algorithmic. It's also capture-side: exposure compensation that targets a face-detected region of interest, higher-bit-depth capture pipelines, and explicit guidance to the user during the capture moment.

Document portrait quality

Many KYC face-match flows compare a fresh selfie against the portrait on an ID card. The portrait was captured years earlier, in a government office, often with poor lighting, sometimes with low-quality scanning afterwards. The reference image is the weak point, not the live selfie. Algorithms designed for selfie-to-selfie comparison (e.g. social-media de-duplication) perform worse on selfie-to-document-portrait than benchmark numbers suggest.

What "accuracy" actually means

When a vendor says "99.6% accuracy," they almost always mean a specific operating point on a specific benchmark. The two numbers that actually matter for KYC are:

False-non-match rate (FNMR) at a fixed false-match rate (FMR). A typical KYC operating point is FMR = 1 in 10,000 (so the system wrongly approves 0.01% of impostor attempts) and FNMR = 1–2% (so 1–2% of genuine customers are wrongly rejected and have to re-try or escalate).
Threshold setting - the similarity score above which you accept a match. Set it too high and FNMR balloons (legitimate users locked out); too low and FMR spikes (impostors get in).

Demand from your vendor the FNMR-at-fixed-FMR for your specific population, ideally with subgroup breakdowns. If they can't provide it, run the test yourself.

Testing on your own population

The single most useful exercise is to benchmark your face-match service on a sample of your real or representative users. Here is a tractable test design:

Collect a sample dataset. Aim for 200–500 genuine pairs (same person, two captures). Get explicit consent for biometric processing under whatever data-protection regime applies. Include the demographic axes you care about: skin tone (use the Fitzpatrick scale, but be aware of its limitations), age bands, and gender. For Somali users, ensure adequate coverage of women in head coverings - a real-world variable that affects matching.
Collect an impostor sample. Aim for 2,000–5,000 random pairs of different people from the same dataset. The impostor sample size has to be large because you're measuring small false-match rates.
Score every pair with your face-match API.
Plot the ROC curve - true-match rate against false-match rate as you sweep the threshold. Repeat per subgroup.
Compare the operating-point FNMR across subgroups. Differences greater than ~1.5× should make you uncomfortable; greater than 3× is a serious fairness problem.

This test takes about a week and tells you more about your real product than any vendor brochure ever will.

Operational mitigations

What do you do when you find a gap? Three levers.

Improve the capture

A surprisingly large amount of the gap comes from imaging conditions. Concrete improvements:

Pre-capture guidance. Real-time face detection, exposure feedback, blur detection. Refuse to submit images that fail quality gates and prompt the user to retry.
Targeted exposure compensation. When your face detector sees a face, override the device's auto-exposure to target the face region, not the scene average. This is dramatically more important for darker skin tones.
Multi-frame capture. Capture 3–5 frames and pick the best one. Software-side super-resolution can help here.
Active guidance. "Move into better light" is OK; "tilt your face slightly to the right" is sometimes better.

Tune your threshold per subgroup, carefully

This is operationally tricky and ethically fraught. Adjusting your match threshold based on user demographics raises questions about disparate treatment. But operating a single threshold also produces disparate outcomes if accuracy is uneven.

A pragmatic middle ground: use a quality-aware threshold. The threshold floats based on image-quality signals (face-region brightness, contrast, blur, pose), not on user demographics directly. Demographics are correlated with quality, but the threshold floats on quality, not on demographics. This is defensible.

Provide a fallback path

No biometric system is perfect, and you don't want your matching threshold to be the load-bearing piece of your customer-facing reliability. Build a fallback:

If face-match fails, allow the user to retry with better lighting / a fresh capture.
If face-match fails repeatedly, route the customer to an enhanced verification step - a video call with an agent, a re-issued document upload, an in-person visit at an agent location.
Treat the fallback as a normal, dignified part of the flow, not a failure. A 1.5% fallback rate is fine; a 1.5% rejection rate is not.

What good looks like in 2026

A face-match operation tuned for East African KYC in 2026 should:

Be benchmarked on a representative sample of your real users.
Show FNMR variance under 1.5× across the demographic axes that matter.
Run capture-time quality gates that bring 95% of submitted captures into a usable range before they hit the model.
Operate a quality-aware threshold, not a fixed one.
Have a documented fallback path for legitimate users who fail the biometric.
Re-benchmark every six months as the model is updated and your user base evolves.

If your face-match is operating without these - and most KYC stacks we audit are - you're carrying invisible risk.

Ogowkey's approach

Ogowkey's face-match is benchmarked specifically on Somali demographic data. Our operating point targets FMR = 1 in 10,000 with FNMR under 2% across the demographic subgroups we test. Our capture SDK runs targeted-exposure capture and per-frame quality gating before submission, so the model sees a clean signal. And our API returns the similarity score and quality features, not just an accept/reject - so you can audit and tune.

For the broader operational picture, see our KYC in Somalia guide. For the document-side of the verification, verifying Somali national IDs covers the portrait you'll be matching against.

Closing

Biometric accuracy is fairness in practice. The fintech that ships a flow that locks out 5% of its real customers because the off-the-shelf vendor was tuned for a different population is not just losing customers - it is building a discriminatory product, knowingly or not. The cost of testing is small; the cost of not testing accrues silently in your sign-up funnel for years.

If you want a second pair of eyes on your face-match operating point, send a sample through the playground or [get in touch](mailto:olow304@gmail.com?subject=Ogowkey%20 - %20face%20match).