Skip to content

SOCIAL profile

Lossy — opt-in. Alongside 10 lossless encoding-repair steps it applies 2 lossy linguistic folding steps and 4 cleaning steps that discard information, so it is never the default (ADR-0004).

It assembles these steps, in order:

# Step Safety Lossless?
1 NormalizeUnicode encoding_repair ✓ lossless
2 StripBidi encoding_repair ✓ lossless
3 FoldPresentationForms encoding_repair ✓ lossless
4 RemoveTatweel encoding_repair ✓ lossless
5 UnifyLookalikes encoding_repair ✓ lossless
6 CollapseWhitespace encoding_repair ✓ lossless
7 NormalizeUnicode encoding_repair ✓ lossless
8 CleanURLs cleaning ✗ lossy — non-linguistic noise
9 CleanMentions cleaning ✗ lossy — non-linguistic noise
10 CleanHashtags cleaning ✗ lossy — non-linguistic noise
11 CleanHTML cleaning ✗ lossy — non-linguistic noise
12 RemoveTashkeel linguistic_folding ✗ lossy — a linguistic distinction
13 ReduceElongation linguistic_folding ✗ lossy — a linguistic distinction
14 HandleEmoji encoding_repair ✓ lossless
15 CollapseWhitespace encoding_repair ✓ lossless
16 NormalizeUnicode encoding_repair ✓ lossless

See the API reference for what each step does and how to configure it.