Skip to content

ML profile

Lossy — opt-in. Alongside 9 lossless encoding-repair steps it applies 2 lossy linguistic folding steps that discard information, so it is never the default (ADR-0004).

It assembles these steps, in order:

# Step Safety Lossless?
1 NormalizeUnicode encoding_repair ✓ lossless
2 StripBidi encoding_repair ✓ lossless
3 FoldPresentationForms encoding_repair ✓ lossless
4 RemoveTatweel encoding_repair ✓ lossless
5 UnifyLookalikes encoding_repair ✓ lossless
6 CollapseWhitespace encoding_repair ✓ lossless
7 NormalizeUnicode encoding_repair ✓ lossless
8 RemoveTashkeel linguistic_folding ✗ lossy — a linguistic distinction
9 ReduceElongation linguistic_folding ✗ lossy — a linguistic distinction
10 CollapseWhitespace encoding_repair ✓ lossless
11 NormalizeUnicode encoding_repair ✓ lossless

See the API reference for what each step does and how to configure it.