Benchmarks

Tarjumi v7 — Translation Quality

Evaluated on standard public benchmarks across three domains: religious text (Masakhane-MT), news (MAFAND-MT), and health (TICO-19). All scores are corpus-level, EN→target direction, computed with sacrebleu.

Standard Benchmark Results

Tarjumi v7 is NLLB-200 (distilled 600M) fine-tuned on 4.42M tier-rebalanced parallel pairs across 16 African languages, with a Levelt speech-production pipeline for beam reranking, grammar validation, and register correction.

Benchmark	Language	Domain	N	BLEU	chrF++
Masakhane-MT	Swahili	Religious	2,718	35.9	56.9
Masakhane-MT	Luo	Religious	2,702	22.9	44.6
Masakhane-MT	Kikuyu	Religious	2,688	22.5	43.7
Masakhane-MT	Kamba	Religious	2,586	13.3	27.4
TICO-19	Swahili	Health	3,071	23.3	51.9
TICO-19	Somali	Health	3,071	7.6	30.3
MAFAND-MT	Swahili	News	1,801	20.0	48.2
MAFAND-MT	Luo	News	1,464	6.8	31.8
FLORES-200	Swahili	General	1,012	21.7	50.5
FLORES-200	Somali	General	1,012	8.3	37.2
FLORES-200	Luo	General	1,012	5.8	30.2
FLORES-200	Kikuyu	General	1,012	4.6	28.3
FLORES-200	Kamba	General	1,012	2.4	19.2

Key finding: domain overfitting is the dominant problem. Swahili ranges from 35.9 BLEU (religious) to 21.7 (general) to 20.0 (news). Kikuyu drops from 22.5 (religious) to 4.6 (general) — an 18-point collapse. The model translates religious text well because that's what it was trained on. General domain capability was lost during fine-tuning, especially for low-resource languages.

Swahili retained general capability (21.7 on FLORES) because it had the most diverse training data. For Kikuyu, Luo, and Kamba, the training corpus was too narrow — almost entirely religious text. This is catastrophic forgetting: fine-tuning improved in-domain performance at the cost of the base model's general knowledge.

Reading BLEU Scores

BLEU measures n-gram overlap between machine translation and human reference. For morphologically rich African languages, chrF++ (character-level F-score) is often more informative since it handles agglutination better. Context for the numbers:

30+

High Quality

Understandable with minor errors. Suitable for gisting, internal documents, and assisted workflows.

20-30

Usable

Gets the meaning across. Benefits from native speaker review for important content.

10-20

Developing

Captures core meaning but may miss nuance. Active improvement through training data collection.

<10

Early Stage

Limited training data for this domain. Useful for basic understanding but needs human verification.

Convergent QA

Standard metrics like BLEU require human-written reference translations. Convergent QA is our reference-free quality metric — it measures whether a translation preserves meaning by checking if the round-trip survives.

Source (EN) → Translate EN→X → Back-translate X→EN → Ξ = distance(source, reconstructed)

Two independent paths reconstruct the original. When they agree (Ξ → 0), meaning was preserved. When they diverge, something was lost in translation. No gold references needed — just the source text and the translation engine.

How Ξ is Computed

Semantic

Ξ_semantic

Cosine distance between multilingual E5 embeddings of the source and reconstructed text. Captures whether the meaning survived, even if the words changed.

Lexical

Ξ_lexical

1 − sentence BLEU between source and reconstructed text. Captures whether the exact words and structure survived the round-trip.

Composite

Ξ_composite

Weighted blend: 60% semantic + 40% lexical. A sentence passes when Ξ_composite < 0.30. Below this threshold, meaning is reliably preserved.

Convergent QA Results

200 sentences per language, sampled from each benchmark. Lower Ξ = better preservation. Pass rate = percentage of sentences where meaning survives the round-trip.

Benchmark	Language	Ξ_sem	Ξ_lex	Ξ_comp	Pass Rate
Masakhane-MT	Swahili	0.038	0.473	0.212	63.5%
Masakhane-MT	Luo	0.039	0.502	0.224	59.0%
Masakhane-MT	Kikuyu	0.048	0.573	0.258	53.0%
Masakhane-MT	Kamba	0.047	0.593	0.266	49.0%
MAFAND-MT	Swahili	0.047	0.702	0.309	42.0%
MAFAND-MT	Luo	0.083	0.870	0.398	8.0%
TICO-19	Swahili	0.053	0.764	0.337	35.0%
TICO-19	Somali	0.056	0.774	0.343	32.0%

Levelt Pipeline Effect

Comparing baseline (greedy decoding) vs Tarjumi v7 (Levelt pipeline with beam reranking + grammar validation), the Levelt pipeline consistently reduces Ξ and improves pass rates:

Semantic

Ξ_sem −5 to −10%

The Levelt pipeline reduces semantic distance by 5–10% across all languages, meaning more of the original meaning survives the round-trip.

Pass Rate

+1.5 to +10pp

Swahili gains the most: +10pp on TICO-19 (25% → 35%), +2pp on Masakhane. Luo gains +2–2.5pp consistently. Largest gains on out-of-domain text where the grammar kernel catches more errors.

Domain Gap

Religious > News > Health

Swahili pass rate: 63.5% (religious) → 42% (news) → 35% (health). The domain gap visible in BLEU scores is confirmed by CQA. Training data diversity, not model size, is the binding constraint.

Key insight: semantic preservation is strong across all domains (Ξ_sem < 0.06 everywhere except MAFAND Luo). The round-trip gap is primarily lexical — the meaning survives but the exact words change. This is expected for morphologically rich languages where multiple valid surface forms express the same meaning.

How Tarjumi Compares

All systems evaluated on FLORES-200 devtest (1,012 general-domain sentences, EN→target). This is the standard comparison benchmark — same eval set, same direction, same metric.

System	Swahili	Somali	Luo	Kikuyu	Kamba
Tarjumi v7	21.7	8.3	5.8	4.6	2.4
Khaya (NLP Ghana)	—	—	20	11	—
NLLB-200 baseline	~18	—	15	9	—
Google Translate	~15–20	—	n/a	n/a	n/a

Honest assessment: On general-domain text, Tarjumi v7 leads on Swahili (21.7 vs ~18 NLLB baseline) but trails on Kikuyu (4.6 vs 9 baseline, 11 Khaya) and Luo (5.8 vs 15 baseline, 20 Khaya). Fine-tuning on religious-domain data improved in-domain performance but caused catastrophic forgetting of general capability for low-resource languages.

Google Translate does not support Kikuyu, Luo, or Kamba. DeepL covers ~30 languages, none of them Kenyan. For these languages, Tarjumi still provides coverage where no commercial alternative exists — but general-domain quality needs improvement through diverse training data.

Khaya and NLLB-200 scores sourced from NLP Ghana (2023) and the State of NLP in Kenya survey (2024).

Where Tarjumi Leads

Tarjumi v7 excels on domain-specific text matching its training distribution. On religious/general text (Masakhane-MT), Kikuyu reaches 22.5 BLEU and Luo 23.0 — scores that no other system has published for these languages on this domain. For use cases like translating health guides, religious materials, and community documents, Tarjumi delivers strong results. The gap is on general-domain text — Wikipedia, news, technical content — where diverse training data is needed.

Methodology

Benchmarks Used

Masakhane-MT

JW300 test splits preserved by the Masakhane community after the original corpus takedown. Religious/general domain. CC-BY. Languages: Kikuyu, Luo, Kamba, Swahili + 7 others.

MAFAND-MT

News domain parallel text from BBC and VOA, curated by Masakhane. CC-BY-SA-4.0. Languages: Swahili, Luo + 13 others across Africa.

TICO-19

COVID-19 and health domain translation memories. CC0 public domain. Languages: Swahili, Somali + 7 others. Directly relevant to Tarjumi's humanitarian use case.

FLORES-200

General-domain sentences from Wikipedia, curated by Meta for NLLB evaluation. CC-BY-SA. 1,012 devtest sentences per language. The standard benchmark for cross-system comparison.

Scoring

All scores computed with sacrebleu for reproducibility. BLEU uses default tokenization (13a). chrF++ uses word order 2. Convergent QA embeddings use multilingual-e5-small (471M params, 109 languages).

Model

Tarjumi v7: NLLB-200 distilled-600M, fine-tuned bidirectionally on 4.42M tier-rebalanced parallel pairs across 16 African languages. Inference via CTranslate2 INT8 quantization. Levelt pipeline adds beam reranking (5 hypotheses) with Grammar Kernel validation (42 morphosyntactic features for Bantu and Nilotic), ArticulationLM fluency scoring, and register-aware post-editing.

Benchmark run: May 2026. Model: Tarjumi v7 (NLLB fine-tuned, Levelt pipeline). Hardware: Apple M4 Pro CPU (no GPU). All benchmark data is publicly available. Scripts: scripts/benchmark_tarjumi.py, scripts/convergent_qa.py.

BLEU is an imperfect metric for morphologically rich languages — a translation can be correct and fluent while scoring low because it uses different but valid morphological forms. chrF++ and Convergent QA provide complementary signal. We report all metrics across all domains for transparency — including results where Tarjumi underperforms.