API reference¶
The whole public surface: the one-call normalize facade, the Pipeline it assembles, the
configuration boundary, and every Step. Each Arabic term below is glossed to English on hover
(ADR-0007); searching the English name (e.g. diacritics) finds the Arabic-primary step
(RemoveTashkeel). For the araclean shell command, see the CLI reference.
The normalize facade¶
normalize ¶
normalize(
text: str,
*,
profile: str | Profile | None = None,
config: NormalizeConfig | None = None,
**overrides: object,
) -> str
Normalize Arabic text with a named profile (default LIGHT — lossless encoding repair).
profile=None applies LIGHT. Pass profile="search" (etc.) for a named preset, a Profile
object for a fully custom pipeline, or per-knob **overrides to tune a named profile —
normalize(text, profile="ml", map_digits=True) folds digits, profile="social",
emoji="strip" drops emoji. Overrides are validated against NormalizeConfig, so an unknown
knob or a bad value is rejected here. A prebuilt config=NormalizeConfig(...) may be passed
instead, but not together with profile/overrides.
Pipelines & profiles¶
Pipeline ¶
An ordered, serializable sequence of Steps, callable like a single str -> str.
steps
property
¶
The ordered steps, for inspection (e.g. the safety-class audit, story 41).
batch ¶
Normalize each text lazily — a streaming generator, so a corpus larger than memory (or an unbounded stream) processes without materializing the input (story 13).
select ¶
Build a NEW pipeline holding exactly the named steps, in the order given (story 16).
One primitive covers both adapting operations: name a subset to filter, or name every
step in a different order to reorder. Steps are addressed by _step_name (the registry
name, or the class name for a custom step). This pipeline is left unchanged. Raises
KeyError for an unknown name, or if a name matches more than one step with differing
configs (genuinely ambiguous). EQUAL duplicates are interchangeable, not ambiguous —
every profile runs its NormalizeUnicode NFC bookends as identical value objects, so
naming "NormalizeUnicode" (once, or once per copy you want) just works; only a name
whose duplicates differ (SEARCH's two differently-configured CollapseWhitespace) is
rejected, because a name cannot say which one you meant.
drop ¶
Build a NEW pipeline WITHOUT every step matching each name (the subtractive adapter).
The common profile adaptation — "SEARCH minus MapDigits" — is subtraction, which
select cannot express on a built-in profile without re-naming every kept step. drop
removes ALL steps carrying a name (removing every match is well-defined, so duplicates
need no disambiguation) and keeps the rest in order. This pipeline is left unchanged.
Raises KeyError for a name no step carries, so a typo is never a silent no-op.
audit ¶
Audit this pipeline's safety: is it lossless, and if not, what it loses (story 41).
Pure in-process computation: it reads each step's declared safety (fixed at construction)
and buckets the step names by class, so an auditor can verify a pipeline is lossless or
enumerate exactly the lossy steps it carries. The buckets preserve pipeline order.
apply_aligned ¶
Normalize text while tracking how every position maps back to the original.
Returns (normalized, offset_map) where offset_map.to_original(span) projects
any span in the normalized string back to the corresponding span in text.
Raises AlignmentNotSupportedError for any custom step that does not implement
apply_aligned() — the error names the offending step so the caller can add the hook.
to_dict ¶
Serialize to a plain, JSON-friendly dict; raises if a step can't serialize itself.
from_dict
classmethod
¶
Rehydrate a pipeline from to_dict() output via the step registry.
from_profile
classmethod
¶
Build a pipeline from a named profile (e.g. "light") or a Profile object.
Profile ¶
Bases: BaseModel
A named preset: an ordered list of step specs that assemble into a Pipeline.
Offset maps¶
Returned by Pipeline.apply_aligned / a step's apply_aligned — see
Offset-preserving normalization.
OffsetMap ¶
Alignment from normalized-text positions back to original-text positions.
Internally stores a flat list[int] of length 2 * len(normalized): entries
[2*i, 2*i+1] are (orig_start, orig_end) for normalized character i.
identity
staticmethod
¶
Build identity map: normalized char i maps to original [i, i+1).
from_translate
staticmethod
¶
Build offset map from original string + its str.translate table.
Handles all three operation kinds:
- None value → 1→0 deletion (no normalized chars emitted)
- int value → 1→1 replacement (one normalized char)
- str value → 1→N expansion (each result char maps back to the one original char)
- no entry → identity (one normalized char = one original char)
from_regex_sub
staticmethod
¶
from_regex_sub(
original: str,
match_spans: Sequence[tuple[int, int]],
replacement_lens: Sequence[int],
) -> OffsetMap
Build offset map from match spans collected during a regex substitution.
Each match at original[start:end] was replaced by replacement_len normalized
chars that all map back to [start, end). Non-matched chars are identity.
compose ¶
Chain two maps: self maps norm1→orig; other maps norm2→norm1.
Returns a new map that maps norm2→orig, so a caller accumulates after each step:
running = running.compose(step_map) starting from the first step's map.
to_original ¶
Map half-open normalized span [start, end) to original span.
For an empty span (start == end) returns a zero-width point in the original.
to_normalized ¶
Map half-open original span [start, end) to normalized span (best-effort).
For a span that was entirely deleted, returns a zero-width insertion point at the position in normalized text just before where the deleted content appeared.
Configuration¶
The validated override surface behind normalize(..., profile=…, **overrides) — see
Tuning profiles for the task-oriented view.
NormalizeConfig ¶
Bases: BaseModel
A validated normalize call: a profile plus optional per-knob overrides (stories 39/40).
Frozen and extra="forbid", so a typo'd knob (map_digit= for map_digits=) or an unknown
option value fails loudly at construction instead of silently doing nothing — a reproducibility
footgun the trust boundary exists to close. Each override is None by default, meaning "use the
profile's own default for that step"; setting one rewrites exactly that step when resolve()
assembles the effective Profile.
The override surface is the profile name plus per-knob scalars (the shape issue 0016 fixes):
map_digits is ML's optional digit fold (story 6) and is valid ONLY with ML — it appends a
lossy step, which on any other profile would silently break that profile's contract (LIGHT/
CLASSICAL are lossless; SEARCH already folds digits). remove_stopwords is SEARCH's optional
stopword removal — valid only there, because the folded list requires exactly SEARCH's letter
folds before it (the RemoveStopwords ordering contract). The remaining knobs patch a step the
profile carries; an override that names a step the chosen profile does not contain is rejected
by resolve(), so it can never be a silent no-op.
resolve ¶
Assemble the effective Profile: the named preset with this config's overrides applied.
Pure construction (no per-string work): it patches the matching step specs, appends ML's
optional MapDigits and inserts SEARCH's optional RemoveStopwords. Raises ValueError
if an override names a step the profile lacks (e.g. emoji= on LIGHT) or a profile that
does not own it (map_digits off ML, remove_stopwords off SEARCH), so an override is
never a silent no-op.
ProfileName ¶
Bases: StrEnum
The closed set of named profiles normalize accepts (story 39).
A StrEnum so an unknown profile name is rejected at the config boundary with a clear error,
rather than only when the pipeline is assembled.
Safety classes (the lossless / lossy split)¶
SafetyClass ¶
Bases: StrEnum
What kind of information a Step may discard.
ENCODING_REPAIR is lossless and default-on (the LIGHT profile); the other two are lossy
and opt-in, so only an all-ENCODING_REPAIR pipeline is lossless. The two lossy classes name
what is discarded so the audit (story 41) can report it precisely: LINGUISTIC_FOLDING
discards a linguistic distinction within the Arabic text (dediacritization, alef/hamza/teh-
marbuta/maqsura folding, digit/punctuation mapping); CLEANING removes non-linguistic noise
around it (URLs, mentions, HTML, emoji). The two are siblings, not synonyms — Cleaning is a
distinct concern from Normalization (CONTEXT.md), so a URL strip is not "linguistic folding".
See ADR-0011.
SafetyReport
dataclass
¶
The safety-class audit of a Pipeline: is it lossless, and if not, what does it lose?
Story 41 / ADR-0004. Each field lists the names of the steps in that safety class, in pipeline
order, so the report does not merely say that a pipeline is lossy but enumerates which steps
lose information and of what kind — linguistic_folding (a distinction within the Arabic
text) vs cleaning (non-linguistic noise removal). A pipeline is lossless iff it carries no
step of either lossy class, i.e. every step is ENCODING_REPAIR.
Steps¶
Every step is a pure str -> str transform that precomputes its table or regex at construction and
declares its safety class. Steps are grouped here by what they do.
Each step class also exists as a bare function with the same options as keyword arguments
(RemoveTatweel() ↔ remove_tatweel(s), FoldTehMarbuta(target="teh") ↔
fold_teh_marbuta(s, target="teh")) for one-off, validation-free use — Layer 1 of the API
(see Architecture).
Encoding repair (lossless)¶
NormalizeUnicode
dataclass
¶
Compose to a Unicode normalization form (default NFC) — lossless encoding repair.
English: Unicode normalization. Composing to NFC is the canonical first step so visually identical text compares equal regardless of how it was encoded.
StripBidi
dataclass
¶
Remove bidi controls, zero-width characters and the BOM — lossless encoding repair.
English: bidi/zero-width stripping. RLM/LRM/ALM and the embedding/isolate controls, the zero-width non-joiner/space/word-joiner, and the BOM are invisible: they carry no Arabic letter content yet break equality and tokenization, so they are deleted outright.
The zero-width JOINER U+200D is the one CONTEXTUAL case: inside an emoji sequence (👨👩👧,
👨⚕️) the joiner is content — deleting it would split the sequence into its component emoji
(and alter what a later HandleEmoji sees), so a ZWJ flanked by emoji is KEPT and every other
ZWJ is stripped. Residual: a joiner between an emoji and an Arabic letter still goes. That one
rule is a regex pass, so unlike the other LIGHT repairs this step is contextual and stays its
own pass — it does not join the 0018 fused-translate engine (ADR-0006).
FoldPresentationForms
dataclass
¶
Fold Arabic presentation forms back to base letters — lossless encoding repair.
English: presentation-form folding. OCR, legacy encodings and copy-paste leave letters as their contextual presentation glyphs (Forms-A/-B); folding them to the base letters lets such text match normally. The lam-alef ligatures decompose to lam + their matching alef variant (ﻷ → لأ), and combining marks keep their order (a per-character fold, not whole-string NFKC).
translate_table
property
¶
The static str.translate table this step applies — the fused-engine seam (0018).
RemoveTatweel
dataclass
¶
Strip tatweel ـ (U+0640) — lossless encoding repair.
English: tatweel / kashida removal. Tatweel only stretches a word visually for justification; deleting it collapses elongated spellings (محـــمد → محمد) without touching any letter or vocalization mark.
translate_table
property
¶
The static str.translate table this step applies — the fused-engine seam (0018).
UnifyLookalikes
dataclass
¶
Unify look-alike kaf/yeh/heh to Arabic letters — lossless encoding repair.
English: look-alike unification. Under the Arabic-language assumption, letters from other Arabic-script orthographies (Persian keheh ک, Farsi yeh ی, the heh-family forms) are encoding artifacts and fold to the Arabic letter (ک→ك, ی→ي, ھ/ہ/ە→ه). One accepted residual: a Persian yeh used word-finally merges على→علي (U+06CC is indistinguishable from alef maqsura).
translate_table
property
¶
The static str.translate table this step applies — the fused-engine seam (0018).
CollapseWhitespace
dataclass
¶
Collapse whitespace runs — keeping line breaks by default — lossless encoding repair.
English: whitespace collapse. Each whitespace run collapses to a single character, so equality
and tokenization stop depending on how many (or which) spaces a source used: a horizontal run
becomes one ASCII space, and a run containing a line break becomes a single "\n". Line
structure is preserved by default — flattening it to spaces is lossy, not lossless, so it is
opt-in via collapse_lines=True (the recall-oriented behavior SEARCH wants). See ADR-0010.
Runs collapse but are not trimmed, so the step stays a fixed point.
Trim
dataclass
¶
Strip leading and trailing whitespace — lossless encoding repair.
English: trimming. CollapseWhitespace deliberately does NOT trim — collapsing an edge run
in place is what keeps it a fixed point — so edge whitespace survives every profile. This
separate, explicit step removes it (str.strip(), so every Unicode whitespace counts),
keeping both contracts clean: collapse stays a fixed point, trim is its own idempotent
operation a caller composes when wanted. Edge whitespace carries no linguistic signal, so
safety is ENCODING_REPAIR. Positional (start/end), hence contextual — its own pass, not a
0018 fusion candidate.
Linguistic folding (lossy)¶
RemoveTashkeel
dataclass
¶
Remove tashkeel — diacritics / vocalization marks — by class — lossy linguistic folding.
English: dediacritization. The first lossy step and araclean's headline differentiator: which
mark classes to remove is chosen independently (story 26), so a caller can strip harakat while
keeping a meaningful shadda, drop only tanween, etc. Removal deletes the marks alone and never a
carrier letter (a tanween over an alef goes; the alef stays). safety is LINGUISTIC_FOLDING,
so it never runs under LIGHT: it is opt-in via a lossy profile or an explicit step (ADR-0004).
classes defaults to every MarkClass. Sukun rides with HARAKAT (it is the absence of a
vowel, not a haraka, but stripping the vowels while leaving a bare sukun is never wanted). The
orthographic combining madda U+0653 is removed with MADDA; the alef-with-madda letter آ U+0622
is letter folding (issue 0007), kept here.
position selects WHERE the chosen marks are removed: "all" (the default) everywhere via
one str.translate pass; "final" only a WORD-FINAL run of them — the i3rab fold (drop the
case vowel, keep the word-internal vocalization: كِتَابٌ → كِتَاب), PyArabic's
strip_lastharaka parity. A trailing run followed by an unselected mark counts as
word-internal and is kept. "final" is a contextual regex rule, so in that mode the step
stays its own pass and does not join the 0018 fused-translate engine (its translate_table
raises AttributeError, which the planner reads as "not fusible").
translate_table
property
¶
The precomputed str.translate deletion table — the fused-engine seam (0018).
Only position="all" IS one translate pass; "final" is contextual, so this raises
AttributeError and the fusion planner leaves the step as its own pass.
FoldAlef
dataclass
¶
Fold the alef variants أ إ آ ٱ to bare alef ا — lossy linguistic folding.
English: alef folding. The hamza-/madda-bearing alef letters, alef-wasla, and the wavy-hamza
alefs collapse to the plain alef (أ/إ/آ/ٱ/ٲ/ٳ → ا), so spelling variation in how an initial alef
was written stops splitting otherwise-identical words. It discards a real orthographic
distinction, so safety is LINGUISTIC_FOLDING: opt-in via a lossy profile or an explicit
step, never under LIGHT. (Historical/manuscript alefs that are not contemporary Arabic — e.g.
the high-hamza alef U+0675, the Extended-B annotation alefs — are deliberately left alone.)
translate_table
property
¶
The static str.translate table this step applies — the fused-engine seam (0018).
FoldAlefMaqsura
dataclass
¶
Fold alef maqsura ى to yeh ي — lossy linguistic folding.
English: alef-maqsura folding. The dotless final ى (a long-alef sound) folds to yeh ي so the
two spellings stop splitting a word. This merges على and علي, a genuine distinction, so the fold
is LINGUISTIC_FOLDING and never runs under LIGHT: it is opt-in for recall (SEARCH).
translate_table
property
¶
The static str.translate table this step applies — the fused-engine seam (0018).
FoldHamza
dataclass
¶
Fold hamza off its carriers ؤ→و, ئ→ي — separate and configurably aggressive — lossy folding.
English: hamza folding. A toggle kept separate from FoldAlef so hamza can be neutralized on
the waw/yeh carriers (ؤ→و, ئ→ي) without folding alef. Folding lightly (the default) folds the
carriers and deletes the combining hamza marks U+0654/U+0655 (hamza seated on a carrier — the
letter content issue 0006 routes here, not to RemoveTashkeel). Folding heavily
(drop_standalone_hamza=True) also drops the standalone hamza ء U+0621 and the high hamza
ٴ U+0674. The precomposed alef-hamza letters أ/إ are alef variants, left to FoldAlef.
safety is LINGUISTIC_FOLDING.
translate_table
property
¶
The precomputed str.translate table — the fused-engine seam (0018).
FoldTehMarbuta
dataclass
¶
Fold teh marbuta ة to a configurable target (heh by default) — lossy linguistic folding.
English: teh-marbuta folding. The word-final "tied taa" ة (and its goal form ۃ) folds to
TehMarbutaTarget.HEH ه (the common search fold, default), TEH ت (its underlying value), or
is left in place with KEEP. ة marks a real grammatical ending, so the fold discards
information: safety is LINGUISTIC_FOLDING, never run under LIGHT.
translate_table
property
¶
The precomputed str.translate table (empty for KEEP) — fused-engine seam (0018).
FoldTanweenAlef
dataclass
¶
Drop the word-final tanween-fath carrier alef: كتاباً → كتاب — lossy linguistic folding.
English: tanween-alef folding. The adverbial-accusative ending writes its tanween-fath on a
carrier alef (كتاباً, or the same pair typed tanween-first as كتابًا); for recall (SEARCH) the
whole ending folds away so the inflected spelling matches the bare كتاب. RemoveTashkeel
alone cannot do this — it strips only the mark, leaving كتابا, a different spelling — so this
step MUST RUN BEFORE dediacritization, while the tanween still marks which alef is a carrier
(the SEARCH ordering). A tanween seated directly on a letter (خطأً، مدرسةً) has no carrier and
is left to RemoveTashkeel; only the standard fathatan U+064B participates.
It discards a real grammatical ending, so safety is LINGUISTIC_FOLDING: opt-in via SEARCH
or an explicit step, never under LIGHT. A contextual re rule (word-final anchoring), so it
stays its own pass and is not a candidate for the 0018 fused-translate engine (ADR-0006).
MapDigits
dataclass
¶
Convert digits among Arabic-Indic / Extended / ASCII to a target — lossy linguistic folding.
English: digit mapping. Every digit — Arabic-Indic ٠-٩, Extended (Persian/Urdu) ۰-۹, or ASCII
0-9 — is rewritten to the chosen DigitTarget by numeric value, so numbers parse and match
consistently regardless of how they were typed (story 31). The default target is ASCII. The
map erases which script a digit was written in, so safety is LINGUISTIC_FOLDING: opt-in via
a lossy profile or an explicit step, never under LIGHT.
The dedicated Arabic number separators (decimal ٫ U+066B, thousands ٬ U+066C) are NOT digits,
so by default they stay — ١٢٫٥ becomes the mixed-script 12٫5. The opt-in
map_separators=True also rewrites a separator when digit-flanked on BOTH sides (٫ → .,
٬ → ,; the inverse of MapPunctuation's guard), giving 12.5; a stray separator outside a
number is never touched. That guard is a contextual regex, so with the knob on the step stays
its own pass and does not join the 0018 fused-translate engine (its translate_table raises
AttributeError, which the planner reads as "not fusible").
translate_table
property
¶
The precomputed str.translate table this step applies — fused-engine seam (0018).
Only the pure digit map IS one translate pass; with map_separators=True the
digit-flanked guard is contextual, so this raises AttributeError and the fusion planner
leaves the step as its own pass.
MapPunctuation
dataclass
¶
Map Arabic punctuation ، ؛ ؟ to Latin , ; ? — number-separator-safe — lossy folding.
English: punctuation mapping. The Arabic comma ،, semicolon ؛ and question mark ؟ fold to
their Latin equivalents so one tokenizer/sentence-splitter works on Arabic text (story 32). A
mark sitting between two digits is a numeric separator (e.g. a thousands-grouped number) and is
preserved, not turned into sentence punctuation; the dedicated decimal/thousands/date separators
are never touched. The fold erases the script of the punctuation, so safety is
LINGUISTIC_FOLDING, never run under LIGHT.
RemovePunctuation
dataclass
¶
Delete every Unicode punctuation character (category P*) — lossy linguistic folding.
English: punctuation removal. The bag-of-words / classification staple every incumbent
ships: for token-frequency features, punctuation is noise. One stated principle: a code point
is removed iff its Unicode general category is P (Po/Pd/Ps/Pe/Pi/Pf/Pc) — which covers the
Arabic marks ، ؛ ؟ ٪ ۔ as much as ASCII and every other script's punctuation, re-derived from
the live UCD so it tracks Unicode releases. Symbols (S: $ + = ~), digits and letters are
not punctuation and pass through. keep carves out characters to preserve (each entry one
character).
Distinct from MapPunctuation (which REWRITES the three Arabic sentence marks to their Latin
equivalents for tokenizer uniformity): this step DELETES, so the two compose — map first if
you want the Latin marks, or just remove everything. Deleting sentence structure is lossy, so
safety is LINGUISTIC_FOLDING: opt-in, never under LIGHT. The whole behavior is one
str.translate, so it is fusible (0018).
translate_table
property
¶
The precomputed str.translate deletion table — the fused-engine seam (0018).
MapQuotes
dataclass
¶
Fold typographic quotation marks to the straight ASCII pair — lossy linguistic folding.
English: quote normalization. Arabic text quotes with guillemets «», and word processors
emit the curly/low-9 variants; folding them all to " / ' (by visual family — double
to double, single to single) gives a tokenizer one quote vocabulary. It erases the quote
style, so safety is LINGUISTIC_FOLDING: opt-in via an explicit step, never under LIGHT
and in no built-in profile. One str.translate pass, so it is fusible (0018).
translate_table
property
¶
The static str.translate table this step applies — the fused-engine seam (0018).
ReduceElongation
dataclass
¶
Collapse runs of >= min_run repeated Arabic letters to cap copies — lossy folding.
English: elongation reduction. Word-lengthening repeats a letter for emphasis (جمييييل, راااائع); this collapses such a run so emphatic spellings stop exploding the vocabulary. Two knobs, because the TRIGGER and the TARGET are different decisions:
min_run— what counts as elongation: only a run of at leastmin_runcopies collapses. Defaults tomax(cap + 1, 3): Arabic spells true doubled letters constantly (the assimilated definite article الله/اللغة, verb prefixes تتكلم, lexical doubles ممكن/مما), while a TRIPLED letter is virtually nonexistent in real spelling — so 3+ is the safe elongation signal (the literature's standard rule) and a double is presumed legitimate. A 2-copy emphatic spelling is indistinguishable from a legitimate double without a lexicon, so it is deliberately left alone.cap— what a run reduces to:cap=1(the default) collapses to the canonical single letter, so جمييييل merges with جميل (what ML/SEARCH want);cap=2keeps a doubled letter so emphasis survives as a feature (what SOCIAL wants — its trigger is already 3, so its behavior is identical with the defaultmin_run).
Only contemporary Arabic letters are capped; digits are never touched (a repeated digit is a
number, not emphasis, so 1000 stays 1000), nor are tashkeel marks or tatweel. The fold discards
the emphasis, so safety is LINGUISTIC_FOLDING: opt-in via a lossy profile or an explicit
step, never under LIGHT. It is a contextual re rule, so it stays its own pass and is not a
candidate for the 0018 fused-translate engine (ADR-0006).
RemoveStopwords
dataclass
¶
Remove curated Arabic stopwords — function-word filtering — lossy linguistic folding.
English: stopword removal. Deletes whole-token occurrences of the bundled, versioned Arabic
stopword list (araclean.stopwords) — prepositions, pronouns, demonstratives, relative
pronouns, neutral conjunctions and particles — so high-frequency function words stop drowning
out content words (IR/retrieval, bag-of-words features). It discards linguistic content from the
Arabic text, so safety is LINGUISTIC_FOLDING: opt-in via a lossy profile or an explicit
step, never under LIGHT (it is content removal, not non-linguistic-noise cleaning — ADR-0011).
Two deliberate properties (story 37): the list is flat, not clitic-aware (ADR-0001), so a
prefixed/suffixed form like والكتاب / فيها is kept — only a standalone token is removed;
and it is negation-safe — the polarity particles ما / لا / لم / لن / ليس
are excluded so removal can never flip a sentence's polarity. A removed token leaves its
whitespace as written (a gap), like the other delete-style steps (CleanURLs); a later
CollapseWhitespace tidies the gaps. The list version is serialized so a Profile pins it
reproducibly.
ORDERING CONTRACT (enforced): the list ships in FOLDED form (araclean.stopwords), so this
step must run AFTER dediacritization and the letter folds — requires_before names them, and
Pipeline rejects at construction any pipeline where they do not precede this step. Folding
first is what makes matching robust: real typed Arabic routinely omits hamza (انا، الى) and
vocalized text never matches a bare list, but after RemoveTashkeel + FoldAlef +
FoldAlefMaqsura + FoldHamza every spelling variant lands on the one folded form the list
carries. The folds are idempotent and cheap, so a pipeline over already-normalized text simply
includes them as no-ops.
Cleaning (lossy)¶
CleanURLs
dataclass
¶
Remove URLs or replace them with a placeholder token — cleaning (non-linguistic noise).
English: URL cleaning. A scheme- (http/https) or www.-prefixed run is metadata noise, not
Arabic content, so it is DELETEd (the default) or, in PLACEHOLDER mode, replaced by the
placeholder token — the AraBERT [رابط]/[URL] expectation, kept first-class. The
default token is the English [URL]; pass placeholder="[رابط]" for the Arabic one.
Matching is conservative (only http(s):// and www. anchor it), so ordinary prose is safe.
safety is CLEANING: it discards non-linguistic noise, a sibling of linguistic folding, so it
never runs under LIGHT — opt-in via a lossy profile (SOCIAL) or an explicit step (ADR-0011).
CleanMentions
dataclass
¶
Remove @mentions or replace them with a placeholder token — cleaning (non-linguistic noise).
English: mention cleaning. An @-handle is metadata noise, so it is DELETEd (the default)
or, in PLACEHOLDER mode, replaced by the placeholder token (the AraBERT [مستخدم]/
[MENTION] expectation, kept first-class; the default token is the English [MENTION]). A
handle is @ plus Unicode word characters, so an Arabic handle @محمد is matched as readily as
@user; a bare @ with no following word character is left alone. An EMAIL ADDRESS is
recognized before the mention shape and kept verbatim — user@example.com is an address,
not a mention to rewrite into user[MENTION].com. The email shape requires a dotted domain,
so the dotless user@example still has its host read as a mention (documented residual).
safety is CLEANING: it discards non-linguistic noise, never run under LIGHT (ADR-0011).
CleanHashtags
dataclass
¶
Segment, remove, or replace #hashtags — cleaning (social-metadata markup), no-op when kept.
English: hashtag handling. A #tag is social metadata wrapping real words — in Arabic
social text often a full phrase (#اليوم_الوطني_السعودي). The default SEGMENT mode applies
the entrenched AraBERT recipe: drop the #, map _ to a space, so the words stay in the
text as content (what SOCIAL pins). DELETE removes the tag outright; PLACEHOLDER swaps in
the placeholder token (default the English [HASHTAG]; pass an Arabic one explicitly);
KEEP leaves tags untouched, so a config override can pin "do not touch hashtags".
A tag is # plus Unicode word characters (Arabic matches as readily as Latin; _ is a
word character, so multi-word tags match whole). In SOCIAL, CleanURLs runs FIRST, so a URL
fragment (…/page#section) is gone before this step could read it as a tag. safety is
mode-dependent, like HandleEmoji: KEEP is a lossless no-op (ENCODING_REPAIR); the
rewriting modes are CLEANING (ADR-0011).
CleanHTML
dataclass
¶
Strip HTML tags and unescape entities — cleaning (non-linguistic noise).
English: HTML cleaning. Markup is noise around the text: each tag is DELETEd (the default,
so you keep the inner text) or, in PLACEHOLDER mode, replaced by the placeholder token, and
HTML entities are always unescaped (& → &, < → <), which a tag-only
strip would miss. Tags are removed BEFORE unescaping, so an intentionally escaped <b>
stays literal text instead of being decoded into a <b> tag and then stripped.
safety is CLEANING: it discards non-linguistic noise, never run under LIGHT (ADR-0011).
Strict idempotence does not hold over arbitrary text — html.unescape decodes only one level,
so a multiply-encoded entity (&amp; → & → &) changes on each pass — but on
realistic single-encoded markup the step is a fixed point.
SCOPE BOUNDARY: this is a tag stripper, not an HTML parser. The content of a container
element survives even when its tags go — including <script> and <style>, whose
JavaScript/CSS text is kept as text. Fine for the social-snippet case this step serves; for
web-scrape corpora, strip script/style containers with a real HTML parser before araclean.
HandleEmoji
dataclass
¶
Keep, strip, or demojize emoji — cleaning (non-linguistic noise), or a no-op when kept.
English: emoji handling. Social text carries affective signal in emoji, so the default KEEP
leaves them untouched (a lossless no-op). STRIP removes them; DEMOJIZE rewrites each to its
text alias (😍 → :smiling_face_with_heart_eyes:) so the signal survives as words a tokenizer
can read. safety is therefore mode-dependent: KEEP is ENCODING_REPAIR (lossless, safe
under LIGHT); STRIP/DEMOJIZE are CLEANING (opt-in noise removal — ADR-0011).
DEMOJIZE needs the optional emoji library (the [emoji] extra), resolved once at
construction so the per-string call stays setup-free; building a DEMOJIZE step without the
extra raises EmojiSupportNotInstalledError. KEEP/STRIP need no dependency — STRIP
recognizes emoji from a built-in Unicode set, so the lean core covers it (it strips a whole
ZWJ sequence, leaving a standalone joiner — invisible formatting owned by StripBidi — alone).
RemoveForeign
dataclass
¶
Remove non-Arabic-script letter spans or replace them with a placeholder token — cleaning.
English: foreign-span removal. The standard Arabic corpus-prep filter: for an Arabic corpus,
embedded foreign words are noise, so a maximal run of non-Arabic-script LETTERS (category L*
outside the Arabic blocks, with any combining marks riding along — a decomposed café
travels whole) is DELETEd (default) or, in PLACEHOLDER mode, replaced by the placeholder
token (default the English [FOREIGN]; pass [أجنبي] explicitly). A span must START with
a letter, so a lone combining mark — the VS16 after an emoji, a stray accent — never opens a
span and emoji are untouched. Digits, punctuation, whitespace and symbols pass through: this
filters foreign WORDS, not structure (RemovePunctuation / MapDigits own those concerns).
safety is CLEANING: for the Arabic-corpus contract, non-Arabic-script content is
surrounding noise like a URL, not an Arabic-internal distinction (ADR-0011) — and like every
cleaning step it is opt-in, never under LIGHT. Deletion leaves whitespace gaps; a later
CollapseWhitespace tidies them. A contextual rule over a UCD-derived span pattern (built
lazily, once per process), so it stays its own pass (ADR-0006).
Step options¶
The closed option sets the configurable steps accept (as enum members or their string values).
MarkClass ¶
Bases: StrEnum
A class of tashkeel marks RemoveTashkeel can remove independently (story 26).
English: diacritic class. The vocalization-mark taxonomy (GLOSSARY: Tashkeel) split into the
units a caller selects between. SUKUN is not a member — it is the vowelless mark (the
absence of a vowel, not a haraka), removed together with HARAKAT for convenience and not
selectable on its own (GLOSSARY: Harakat).
TehMarbutaTarget ¶
Bases: StrEnum
What FoldTehMarbuta rewrites the teh marbuta ة to (story 29).
English: teh-marbuta target. HEH (the common search fold, default) and TEH (its underlying
value) are the standard targets; KEEP leaves ة in place so a profile can pin "do not fold".
DigitTarget ¶
Bases: StrEnum
Which digit system MapDigits converts every digit to (story 31).
English: digit target. ASCII (default) makes numbers parse and match consistently; the two
Arabic systems are ARABIC_INDIC (Eastern ٠-٩) and EXTENDED_ARABIC_INDIC (Persian/Urdu ۰-۹).
CleanMode ¶
Bases: StrEnum
Whether a cleaning step deletes the matched noise or replaces it with a placeholder token.
English: cleaning mode. DELETE (default) removes the span outright; PLACEHOLDER swaps in a
fixed token (e.g. [URL]) so a model keeps "a link was here" as a feature without a noisy
unique value — the entrenched AraBERT expectation, so it is first-class, not just delete.
EmojiMode ¶
Bases: StrEnum
How HandleEmoji treats emoji (story 35).
English: emoji handling. KEEP (default) leaves emoji in place so affective signal survives;
STRIP removes them; DEMOJIZE replaces each with its text alias (😍 → :heart_eyes:),
which needs the optional emoji library (the [emoji] extra).
HashtagMode ¶
Bases: StrEnum
How CleanHashtags treats a #hashtag (roadmap Phase 1).
English: hashtag handling. SEGMENT (default — the entrenched AraBERT recipe) drops the
# and maps _ to a space, so the tag's words survive as text; DELETE removes the tag;
PLACEHOLDER swaps in a fixed token; KEEP leaves it untouched (the no-op a config override
can pin).
Extension & protocol types¶
Step ¶
Bases: Protocol
A single normalization transform: a safety class plus a pure str -> str call.
The reserved, optional alignment hook apply_aligned(s) -> (str, OffsetMap) (ADR-0005)
is intentionally absent from this contract; a Pipeline detects it when present and raises
a clear error otherwise. Steps precompute any table/regex at construction so __call__ does
no setup and no validation (ADR-0006).
A step satisfies the contract by exposing a readable safety attribute — the natural idiom is
a class-level safety = SafetyClass.… assignment (what built-in and custom steps both use).
It is a read-only member: a step's safety class is an intrinsic trait, never reassigned, so a
class variable, a frozen field (when the class varies it by config, like HandleEmoji), or a
property all satisfy it.
Stopword data¶
The curated, versioned list behind RemoveStopwords — see the
stopwords guide. STOPWORDS (the matching set), STOPWORDS_VERSION,
STOPWORDS_LICENSE, and NEGATION_PARTICLES (the polarity particles deliberately excluded) are
importable from the module:
stopwords ¶
The curated, versioned Arabic stopword list backing RemoveStopwords (issue 0017).
Provenance — this list is freshly authored for araclean from common Modern Standard Arabic
function words (prepositions, pronouns, demonstratives, relative pronouns, conjunctions, and
neutral particles). It is not derived from the GPL-licensed Arabic-Stopwords package or any
other copyleft source, so it does not encumber araclean's MIT core; the list itself is dedicated to
the public domain under CC0-1.0 (STOPWORDS_LICENSE). It carries a STOPWORDS_VERSION so a
Profile can pin an exact list and stopword removal stays reproducible across releases.
Design properties (surfaced in the docs, issue 0023):
- FOLDED FORM — run AFTER the letter folds (list version 2). Every entry is stored in its
letter-folded spelling (no hamza-bearing alef أ/إ, no alef maqsura ى, no hamza carrier ؤ/ئ — the
output of
FoldAlef+FoldAlefMaqsura+FoldHamza), andRemoveStopwordsrequires those folds plusRemoveTashkeelto run before it (Pipelineenforces this at construction). The pipeline itself is the spelling-variant generator: the canonical إلى, the routinely-typed hamza-less الى, and the vocalized إلَى all fold to the one entry الى→الي, so matching is robust without enumerating variants._validatedenforces fold-stability per entry at import. - Flat, not clitic-aware (ADR-0001 — no morphology). Each entry is a whole bare token; the list
does not know about proclitics/enclitics, so
والكتاب(and-the-book) andفيها(in-it) are kept — only a standaloneو/فيtoken would be removed. - Negation-safe by default. The polarity-bearing particles
ما/لا/لم/لن/ليس(NEGATION_PARTICLES) are deliberately excluded, soRemoveStopwordscan never silently flip the sentiment of a sentence by deleting its negation (story 37). - Homograph policy — one stated principle. An entry is kept iff its FUNCTION-word reading dominates its content readings by token frequency in running text (in the folded spelling, since folding merges words). Dropped under that principle: أم (or — but commonly mother: أم محمد), كم (how many — but also km), نفس (same — but commonly soul/self/breath), كأن (as if — folds onto كان was, one of the most frequent verbs), أية (any fem. — rare standalone, and its folded ايه is the dialectal what?). Kept despite a known collision: علي (the fold of the preposition على — the preposition dwarfs the name Ali in running text; the residual is that a bare standalone علي is removed), لان (the fold of لأن because — the verb reading softened is rare), ان (the shared fold of إن and أن — the rare آن time also lands there and is accepted).
DataFrame accessors¶
Installed by importing araclean.pandas / araclean.polars (each needs its extra) — see
pandas & polars.
AracleanAccessor ¶
The .araclean Series accessor: series.araclean.normalize(profile=..., **overrides).
pandas instantiates this with the Series the accessor was reached through.
normalize ¶
Normalize each value in the Series with a named profile (default LIGHT) + overrides.
Equivalent to series.map(lambda x: normalize(x, profile=..., **overrides)) but builds
the pipeline once. profile and the per-knob **overrides (e.g. map_digits=True,
emoji="strip") are validated through the config trust boundary, so an unknown profile,
knob, or value raises the same clear error as the normalize facade. Missing values
(NaN/None) pass through unchanged (na_action="ignore"); empty strings normalize
to empty strings.
AracleanNamespace ¶
The .araclean Series namespace: series.araclean.normalize(profile=..., **overrides).
polars instantiates this with the Series the namespace was reached through.
normalize ¶
Normalize each value in the Series with a named profile (default LIGHT) + overrides.
Equivalent to mapping normalize(x, profile=..., **overrides) element-wise but builds
the pipeline once. profile and the per-knob **overrides (e.g. map_digits=True,
emoji="strip") are validated through the config trust boundary, so an unknown profile,
knob, or value raises the same clear error as the normalize facade. Null values pass
through unchanged (map_elements skips nulls); empty strings normalize to empty strings.
The result is a String Series, matching the pandas accessor value-for-value.
Errors¶
AlignmentNotSupportedError ¶
Bases: NotImplementedError
Raised when offsets/alignment are requested through a step that lacks apply_aligned.
Offset tracking is reserved but not implemented in v1 (ADR-0005). Subclasses
NotImplementedError so callers probing for the capability can fall back.
EmojiSupportNotInstalledError ¶
Bases: ImportError
Raised when HandleEmoji(mode="demojize") is built without the optional emoji extra.
KEEP/STRIP need no dependency; only DEMOJIZE requires the emoji library, kept out of
the lean MIT core (ADR-0003). Subclasses ImportError so a caller probing for the capability
can catch it; the message says how to install the extra.
CLIExtraNotInstalledError ¶
Bases: ImportError
Raised when the CLI is built without the optional [cli] extra (Typer) installed.
Subclasses ImportError (so a caller probing for the capability can catch it); the message
says how to install it. Mirrors EmojiSupportNotInstalledError for the [emoji] extra.
PandasExtraNotInstalledError ¶
Bases: ImportError
Raised when the pandas accessor is used without the optional [pandas] extra installed.
Subclasses ImportError (so a caller probing for the capability can catch it); the message
says how to install it. Mirrors EmojiSupportNotInstalledError / CLIExtraNotInstalledError.
PolarsExtraNotInstalledError ¶
Bases: ImportError
Raised when the polars namespace is used without the optional [polars] extra installed.
Subclasses ImportError (so a caller probing for the capability can catch it); the message
says how to install it. Mirrors PandasExtraNotInstalledError / CLIExtraNotInstalledError.