Tarjumi v7 — Translation Quality
Evaluated on standard public benchmarks across three domains: religious text (Masakhane-MT), news (MAFAND-MT), and health (TICO-19). All scores are corpus-level, EN→target direction, computed with sacrebleu.
Standard Benchmark Results
Tarjumi v7 is NLLB-200 (distilled 600M) fine-tuned on 4.42M tier-rebalanced parallel pairs across 16 African languages, with a Levelt speech-production pipeline for beam reranking, grammar validation, and register correction.
| Benchmark | Language | Domain | N | BLEU | chrF++ |
|---|---|---|---|---|---|
| Masakhane-MT | Swahili | Religious | 2,718 | 35.9 | 56.9 |
| Masakhane-MT | Luo | Religious | 2,702 | 22.9 | 44.6 |
| Masakhane-MT | Kikuyu | Religious | 2,688 | 22.5 | 43.7 |
| Masakhane-MT | Kamba | Religious | 2,586 | 13.3 | 27.4 |
| TICO-19 | Swahili | Health | 3,071 | 23.3 | 51.9 |
| TICO-19 | Somali | Health | 3,071 | 7.6 | 30.3 |
| MAFAND-MT | Swahili | News | 1,801 | 20.0 | 48.2 |
| MAFAND-MT | Luo | News | 1,464 | 6.8 | 31.8 |
| FLORES-200 | Swahili | General | 1,012 | 21.7 | 50.5 |
| FLORES-200 | Somali | General | 1,012 | 8.3 | 37.2 |
| FLORES-200 | Luo | General | 1,012 | 5.8 | 30.2 |
| FLORES-200 | Kikuyu | General | 1,012 | 4.6 | 28.3 |
| FLORES-200 | Kamba | General | 1,012 | 2.4 | 19.2 |
Key finding: domain overfitting is the dominant problem. Swahili ranges from 35.9 BLEU (religious) to 21.7 (general) to 20.0 (news). Kikuyu drops from 22.5 (religious) to 4.6 (general) — an 18-point collapse. The model translates religious text well because that's what it was trained on. General domain capability was lost during fine-tuning, especially for low-resource languages.
Swahili retained general capability (21.7 on FLORES) because it had the most diverse training data. For Kikuyu, Luo, and Kamba, the training corpus was too narrow — almost entirely religious text. This is catastrophic forgetting: fine-tuning improved in-domain performance at the cost of the base model's general knowledge.
Reading BLEU Scores
BLEU measures n-gram overlap between machine translation and human reference. For morphologically rich African languages, chrF++ (character-level F-score) is often more informative since it handles agglutination better. Context for the numbers:
High Quality
Understandable with minor errors. Suitable for gisting, internal documents, and assisted workflows.
Usable
Gets the meaning across. Benefits from native speaker review for important content.
Developing
Captures core meaning but may miss nuance. Active improvement through training data collection.
Early Stage
Limited training data for this domain. Useful for basic understanding but needs human verification.
Convergent QA
Standard metrics like BLEU require human-written reference translations. Convergent QA is our reference-free quality metric — it measures whether a translation preserves meaning by checking if the round-trip survives.
Two independent paths reconstruct the original. When they agree (Ξ → 0), meaning was preserved. When they diverge, something was lost in translation. No gold references needed — just the source text and the translation engine.
How Ξ is Computed
Ξsemantic
Cosine distance between multilingual E5 embeddings of the source and reconstructed text. Captures whether the meaning survived, even if the words changed.
Ξlexical
1 − sentence BLEU between source and reconstructed text. Captures whether the exact words and structure survived the round-trip.
Ξcomposite
Weighted blend: 60% semantic + 40% lexical. A sentence passes when Ξcomposite < 0.30. Below this threshold, meaning is reliably preserved.
Convergent QA Results
200 sentences per language, sampled from each benchmark. Lower Ξ = better preservation. Pass rate = percentage of sentences where meaning survives the round-trip.
| Benchmark | Language | Ξsem | Ξlex | Ξcomp | Pass Rate |
|---|---|---|---|---|---|
| Masakhane-MT | Swahili | 0.038 | 0.473 | 0.212 | 63.5% |
| Masakhane-MT | Luo | 0.039 | 0.502 | 0.224 | 59.0% |
| Masakhane-MT | Kikuyu | 0.048 | 0.573 | 0.258 | 53.0% |
| Masakhane-MT | Kamba | 0.047 | 0.593 | 0.266 | 49.0% |
| MAFAND-MT | Swahili | 0.047 | 0.702 | 0.309 | 42.0% |
| MAFAND-MT | Luo | 0.083 | 0.870 | 0.398 | 8.0% |
| TICO-19 | Swahili | 0.053 | 0.764 | 0.337 | 35.0% |
| TICO-19 | Somali | 0.056 | 0.774 | 0.343 | 32.0% |
Levelt Pipeline Effect
Comparing baseline (greedy decoding) vs Tarjumi v7 (Levelt pipeline with beam reranking + grammar validation), the Levelt pipeline consistently reduces Ξ and improves pass rates:
Ξsem −5 to −10%
The Levelt pipeline reduces semantic distance by 5–10% across all languages, meaning more of the original meaning survives the round-trip.
+1.5 to +10pp
Swahili gains the most: +10pp on TICO-19 (25% → 35%), +2pp on Masakhane. Luo gains +2–2.5pp consistently. Largest gains on out-of-domain text where the grammar kernel catches more errors.
Religious > News > Health
Swahili pass rate: 63.5% (religious) → 42% (news) → 35% (health). The domain gap visible in BLEU scores is confirmed by CQA. Training data diversity, not model size, is the binding constraint.
Key insight: semantic preservation is strong across all domains (Ξsem < 0.06 everywhere except MAFAND Luo). The round-trip gap is primarily lexical — the meaning survives but the exact words change. This is expected for morphologically rich languages where multiple valid surface forms express the same meaning.
How Tarjumi Compares
All systems evaluated on FLORES-200 devtest (1,012 general-domain sentences, EN→target). This is the standard comparison benchmark — same eval set, same direction, same metric.
| System | Swahili | Somali | Luo | Kikuyu | Kamba |
|---|---|---|---|---|---|
| Tarjumi v7 | 21.7 | 8.3 | 5.8 | 4.6 | 2.4 |
| Khaya (NLP Ghana) | — | — | 20 | 11 | — |
| NLLB-200 baseline | ~18 | — | 15 | 9 | — |
| Google Translate | ~15–20 | — | n/a | n/a | n/a |
Honest assessment: On general-domain text, Tarjumi v7 leads on Swahili (21.7 vs ~18 NLLB baseline) but trails on Kikuyu (4.6 vs 9 baseline, 11 Khaya) and Luo (5.8 vs 15 baseline, 20 Khaya). Fine-tuning on religious-domain data improved in-domain performance but caused catastrophic forgetting of general capability for low-resource languages.
Google Translate does not support Kikuyu, Luo, or Kamba. DeepL covers ~30 languages, none of them Kenyan. For these languages, Tarjumi still provides coverage where no commercial alternative exists — but general-domain quality needs improvement through diverse training data.
Khaya and NLLB-200 scores sourced from NLP Ghana (2023) and the State of NLP in Kenya survey (2024).
Where Tarjumi Leads
Tarjumi v7 excels on domain-specific text matching its training distribution. On religious/general text (Masakhane-MT), Kikuyu reaches 22.5 BLEU and Luo 23.0 — scores that no other system has published for these languages on this domain. For use cases like translating health guides, religious materials, and community documents, Tarjumi delivers strong results. The gap is on general-domain text — Wikipedia, news, technical content — where diverse training data is needed.
Methodology
Benchmarks Used
Masakhane-MT
JW300 test splits preserved by the Masakhane community after the original corpus takedown. Religious/general domain. CC-BY. Languages: Kikuyu, Luo, Kamba, Swahili + 7 others.
MAFAND-MT
News domain parallel text from BBC and VOA, curated by Masakhane. CC-BY-SA-4.0. Languages: Swahili, Luo + 13 others across Africa.
TICO-19
COVID-19 and health domain translation memories. CC0 public domain. Languages: Swahili, Somali + 7 others. Directly relevant to Tarjumi's humanitarian use case.
FLORES-200
General-domain sentences from Wikipedia, curated by Meta for NLLB evaluation. CC-BY-SA. 1,012 devtest sentences per language. The standard benchmark for cross-system comparison.
Scoring
All scores computed with sacrebleu for reproducibility. BLEU uses default tokenization (13a). chrF++ uses word order 2. Convergent QA embeddings use multilingual-e5-small (471M params, 109 languages).
Model
Tarjumi v7: NLLB-200 distilled-600M, fine-tuned bidirectionally on 4.42M tier-rebalanced parallel pairs across 16 African languages. Inference via CTranslate2 INT8 quantization. Levelt pipeline adds beam reranking (5 hypotheses) with Grammar Kernel validation (42 morphosyntactic features for Bantu and Nilotic), ArticulationLM fluency scoring, and register-aware post-editing.
Benchmark run: May 2026. Model: Tarjumi v7 (NLLB fine-tuned, Levelt pipeline). Hardware: Apple M4 Pro CPU (no GPU). All benchmark data is publicly available. Scripts: scripts/benchmark_tarjumi.py, scripts/convergent_qa.py.
BLEU is an imperfect metric for morphologically rich languages — a translation can be correct and fluent while scoring low because it uses different but valid morphological forms. chrF++ and Convergent QA provide complementary signal. We report all metrics across all domains for transparency — including results where Tarjumi underperforms.