SOCIAL profile¶
Lossy — opt-in. Alongside 10 lossless encoding-repair steps it applies 2 lossy linguistic folding steps and 4 cleaning steps that discard information, so it is never the default (ADR-0004).
It assembles these steps, in order:
| # | Step | Safety | Lossless? |
|---|---|---|---|
| 1 | NormalizeUnicode |
encoding_repair |
✓ lossless |
| 2 | StripBidi |
encoding_repair |
✓ lossless |
| 3 | FoldPresentationForms |
encoding_repair |
✓ lossless |
| 4 | RemoveTatweel |
encoding_repair |
✓ lossless |
| 5 | UnifyLookalikes |
encoding_repair |
✓ lossless |
| 6 | CollapseWhitespace |
encoding_repair |
✓ lossless |
| 7 | NormalizeUnicode |
encoding_repair |
✓ lossless |
| 8 | CleanURLs |
cleaning |
✗ lossy — non-linguistic noise |
| 9 | CleanMentions |
cleaning |
✗ lossy — non-linguistic noise |
| 10 | CleanHashtags |
cleaning |
✗ lossy — non-linguistic noise |
| 11 | CleanHTML |
cleaning |
✗ lossy — non-linguistic noise |
| 12 | RemoveTashkeel |
linguistic_folding |
✗ lossy — a linguistic distinction |
| 13 | ReduceElongation |
linguistic_folding |
✗ lossy — a linguistic distinction |
| 14 | HandleEmoji |
encoding_repair |
✓ lossless |
| 15 | CollapseWhitespace |
encoding_repair |
✓ lossless |
| 16 | NormalizeUnicode |
encoding_repair |
✓ lossless |
See the API reference for what each step does and how to configure it.