SEARCH profile¶
Lossy — opt-in. Alongside 9 lossless encoding-repair steps it applies 9 lossy linguistic folding steps that discard information, so it is never the default (ADR-0004).
It assembles these steps, in order:
| # | Step | Safety | Lossless? |
|---|---|---|---|
| 1 | NormalizeUnicode |
encoding_repair |
✓ lossless |
| 2 | StripBidi |
encoding_repair |
✓ lossless |
| 3 | FoldPresentationForms |
encoding_repair |
✓ lossless |
| 4 | RemoveTatweel |
encoding_repair |
✓ lossless |
| 5 | UnifyLookalikes |
encoding_repair |
✓ lossless |
| 6 | CollapseWhitespace |
encoding_repair |
✓ lossless |
| 7 | NormalizeUnicode |
encoding_repair |
✓ lossless |
| 8 | FoldTanweenAlef |
linguistic_folding |
✗ lossy — a linguistic distinction |
| 9 | RemoveTashkeel |
linguistic_folding |
✗ lossy — a linguistic distinction |
| 10 | FoldAlef |
linguistic_folding |
✗ lossy — a linguistic distinction |
| 11 | FoldHamza |
linguistic_folding |
✗ lossy — a linguistic distinction |
| 12 | FoldTehMarbuta |
linguistic_folding |
✗ lossy — a linguistic distinction |
| 13 | FoldAlefMaqsura |
linguistic_folding |
✗ lossy — a linguistic distinction |
| 14 | MapDigits |
linguistic_folding |
✗ lossy — a linguistic distinction |
| 15 | MapPunctuation |
linguistic_folding |
✗ lossy — a linguistic distinction |
| 16 | ReduceElongation |
linguistic_folding |
✗ lossy — a linguistic distinction |
| 17 | CollapseWhitespace |
encoding_repair |
✓ lossless |
| 18 | NormalizeUnicode |
encoding_repair |
✓ lossless |
See the API reference for what each step does and how to configure it.