Skip to content

API reference

The whole public surface: the one-call normalize facade, the Pipeline it assembles, the configuration boundary, and every Step. Each Arabic term below is glossed to English on hover (ADR-0007); searching the English name (e.g. diacritics) finds the Arabic-primary step (RemoveTashkeel). For the araclean shell command, see the CLI reference.

The normalize facade

normalize

normalize(
    text: str,
    *,
    profile: str | Profile | None = None,
    config: NormalizeConfig | None = None,
    **overrides: object,
) -> str

Normalize Arabic text with a named profile (default LIGHT — lossless encoding repair).

profile=None applies LIGHT. Pass profile="search" (etc.) for a named preset, a Profile object for a fully custom pipeline, or per-knob **overrides to tune a named profile — normalize(text, profile="ml", map_digits=True) folds digits, profile="social", emoji="strip" drops emoji. Overrides are validated against NormalizeConfig, so an unknown knob or a bad value is rejected here. A prebuilt config=NormalizeConfig(...) may be passed instead, but not together with profile/overrides.

Pipelines & profiles

Pipeline

An ordered, serializable sequence of Steps, callable like a single str -> str.

steps property

steps: tuple[Step, ...]

The ordered steps, for inspection (e.g. the safety-class audit, story 41).

batch

batch(texts: Iterable[str]) -> Iterator[str]

Normalize each text lazily — a streaming generator, so a corpus larger than memory (or an unbounded stream) processes without materializing the input (story 13).

select

select(*names: str) -> Pipeline

Build a NEW pipeline holding exactly the named steps, in the order given (story 16).

One primitive covers both adapting operations: name a subset to filter, or name every step in a different order to reorder. Steps are addressed by _step_name (the registry name, or the class name for a custom step). This pipeline is left unchanged. Raises KeyError for an unknown name, or if a name matches more than one step with differing configs (genuinely ambiguous). EQUAL duplicates are interchangeable, not ambiguous — every profile runs its NormalizeUnicode NFC bookends as identical value objects, so naming "NormalizeUnicode" (once, or once per copy you want) just works; only a name whose duplicates differ (SEARCH's two differently-configured CollapseWhitespace) is rejected, because a name cannot say which one you meant.

drop

drop(*names: str) -> Pipeline

Build a NEW pipeline WITHOUT every step matching each name (the subtractive adapter).

The common profile adaptation — "SEARCH minus MapDigits" — is subtraction, which select cannot express on a built-in profile without re-naming every kept step. drop removes ALL steps carrying a name (removing every match is well-defined, so duplicates need no disambiguation) and keeps the rest in order. This pipeline is left unchanged. Raises KeyError for a name no step carries, so a typo is never a silent no-op.

audit

audit() -> SafetyReport

Audit this pipeline's safety: is it lossless, and if not, what it loses (story 41).

Pure in-process computation: it reads each step's declared safety (fixed at construction) and buckets the step names by class, so an auditor can verify a pipeline is lossless or enumerate exactly the lossy steps it carries. The buckets preserve pipeline order.

apply_aligned

apply_aligned(text: str) -> tuple[str, OffsetMap]

Normalize text while tracking how every position maps back to the original.

Returns (normalized, offset_map) where offset_map.to_original(span) projects any span in the normalized string back to the corresponding span in text.

Raises AlignmentNotSupportedError for any custom step that does not implement apply_aligned() — the error names the offending step so the caller can add the hook.

to_dict

to_dict() -> PipelineDict

Serialize to a plain, JSON-friendly dict; raises if a step can't serialize itself.

from_dict classmethod

from_dict(data: Mapping[str, Any]) -> Pipeline

Rehydrate a pipeline from to_dict() output via the step registry.

from_profile classmethod

from_profile(profile: str | Profile) -> Pipeline

Build a pipeline from a named profile (e.g. "light") or a Profile object.

Profile

Bases: BaseModel

A named preset: an ordered list of step specs that assemble into a Pipeline.

Offset maps

Returned by Pipeline.apply_aligned / a step's apply_aligned — see Offset-preserving normalization.

OffsetMap

Alignment from normalized-text positions back to original-text positions.

Internally stores a flat list[int] of length 2 * len(normalized): entries [2*i, 2*i+1] are (orig_start, orig_end) for normalized character i.

identity staticmethod

identity(n: int) -> OffsetMap

Build identity map: normalized char i maps to original [i, i+1).

from_translate staticmethod

from_translate(
    original: str, table: Mapping[int, str | int | None]
) -> OffsetMap

Build offset map from original string + its str.translate table.

Handles all three operation kinds: - None value → 1→0 deletion (no normalized chars emitted) - int value → 1→1 replacement (one normalized char) - str value → 1→N expansion (each result char maps back to the one original char) - no entry → identity (one normalized char = one original char)

from_regex_sub staticmethod

from_regex_sub(
    original: str,
    match_spans: Sequence[tuple[int, int]],
    replacement_lens: Sequence[int],
) -> OffsetMap

Build offset map from match spans collected during a regex substitution.

Each match at original[start:end] was replaced by replacement_len normalized chars that all map back to [start, end). Non-matched chars are identity.

compose

compose(other: OffsetMap) -> OffsetMap

Chain two maps: self maps norm1→orig; other maps norm2→norm1.

Returns a new map that maps norm2→orig, so a caller accumulates after each step: running = running.compose(step_map) starting from the first step's map.

to_original

to_original(span: tuple[int, int]) -> tuple[int, int]

Map half-open normalized span [start, end) to original span.

For an empty span (start == end) returns a zero-width point in the original.

to_normalized

to_normalized(span: tuple[int, int]) -> tuple[int, int]

Map half-open original span [start, end) to normalized span (best-effort).

For a span that was entirely deleted, returns a zero-width insertion point at the position in normalized text just before where the deleted content appeared.

Configuration

The validated override surface behind normalize(..., profile=…, **overrides) — see Tuning profiles for the task-oriented view.

NormalizeConfig

Bases: BaseModel

A validated normalize call: a profile plus optional per-knob overrides (stories 39/40).

Frozen and extra="forbid", so a typo'd knob (map_digit= for map_digits=) or an unknown option value fails loudly at construction instead of silently doing nothing — a reproducibility footgun the trust boundary exists to close. Each override is None by default, meaning "use the profile's own default for that step"; setting one rewrites exactly that step when resolve() assembles the effective Profile.

The override surface is the profile name plus per-knob scalars (the shape issue 0016 fixes): map_digits is ML's optional digit fold (story 6) and is valid ONLY with ML — it appends a lossy step, which on any other profile would silently break that profile's contract (LIGHT/ CLASSICAL are lossless; SEARCH already folds digits). remove_stopwords is SEARCH's optional stopword removal — valid only there, because the folded list requires exactly SEARCH's letter folds before it (the RemoveStopwords ordering contract). The remaining knobs patch a step the profile carries; an override that names a step the chosen profile does not contain is rejected by resolve(), so it can never be a silent no-op.

resolve

resolve() -> Profile

Assemble the effective Profile: the named preset with this config's overrides applied.

Pure construction (no per-string work): it patches the matching step specs, appends ML's optional MapDigits and inserts SEARCH's optional RemoveStopwords. Raises ValueError if an override names a step the profile lacks (e.g. emoji= on LIGHT) or a profile that does not own it (map_digits off ML, remove_stopwords off SEARCH), so an override is never a silent no-op.

ProfileName

Bases: StrEnum

The closed set of named profiles normalize accepts (story 39).

A StrEnum so an unknown profile name is rejected at the config boundary with a clear error, rather than only when the pipeline is assembled.

Safety classes (the lossless / lossy split)

SafetyClass

Bases: StrEnum

What kind of information a Step may discard.

ENCODING_REPAIR is lossless and default-on (the LIGHT profile); the other two are lossy and opt-in, so only an all-ENCODING_REPAIR pipeline is lossless. The two lossy classes name what is discarded so the audit (story 41) can report it precisely: LINGUISTIC_FOLDING discards a linguistic distinction within the Arabic text (dediacritization, alef/hamza/teh- marbuta/maqsura folding, digit/punctuation mapping); CLEANING removes non-linguistic noise around it (URLs, mentions, HTML, emoji). The two are siblings, not synonyms — Cleaning is a distinct concern from Normalization (CONTEXT.md), so a URL strip is not "linguistic folding". See ADR-0011.

SafetyReport dataclass

The safety-class audit of a Pipeline: is it lossless, and if not, what does it lose?

Story 41 / ADR-0004. Each field lists the names of the steps in that safety class, in pipeline order, so the report does not merely say that a pipeline is lossy but enumerates which steps lose information and of what kindlinguistic_folding (a distinction within the Arabic text) vs cleaning (non-linguistic noise removal). A pipeline is lossless iff it carries no step of either lossy class, i.e. every step is ENCODING_REPAIR.

lossless property

lossless: bool

True iff no lossy step is present (every step is ENCODING_REPAIR).

lossy_steps property

lossy_steps: tuple[str, ...]

Every lossy step — the linguistic folds followed by the cleaning steps.

Steps

Every step is a pure str -> str transform that precomputes its table or regex at construction and declares its safety class. Steps are grouped here by what they do.

Each step class also exists as a bare function with the same options as keyword arguments (RemoveTatweel()remove_tatweel(s), FoldTehMarbuta(target="teh")fold_teh_marbuta(s, target="teh")) for one-off, validation-free use — Layer 1 of the API (see Architecture).

Encoding repair (lossless)

NormalizeUnicode dataclass

Compose to a Unicode normalization form (default NFC) — lossless encoding repair.

English: Unicode normalization. Composing to NFC is the canonical first step so visually identical text compares equal regardless of how it was encoded.

StripBidi dataclass

Remove bidi controls, zero-width characters and the BOM — lossless encoding repair.

English: bidi/zero-width stripping. RLM/LRM/ALM and the embedding/isolate controls, the zero-width non-joiner/space/word-joiner, and the BOM are invisible: they carry no Arabic letter content yet break equality and tokenization, so they are deleted outright.

The zero-width JOINER U+200D is the one CONTEXTUAL case: inside an emoji sequence (👨‍👩‍👧, 👨‍⚕️) the joiner is content — deleting it would split the sequence into its component emoji (and alter what a later HandleEmoji sees), so a ZWJ flanked by emoji is KEPT and every other ZWJ is stripped. Residual: a joiner between an emoji and an Arabic letter still goes. That one rule is a regex pass, so unlike the other LIGHT repairs this step is contextual and stays its own pass — it does not join the 0018 fused-translate engine (ADR-0006).

FoldPresentationForms dataclass

Fold Arabic presentation forms back to base letters — lossless encoding repair.

English: presentation-form folding. OCR, legacy encodings and copy-paste leave letters as their contextual presentation glyphs (Forms-A/-B); folding them to the base letters lets such text match normally. The lam-alef ligatures decompose to lam + their matching alef variant (ﻷ → لأ), and combining marks keep their order (a per-character fold, not whole-string NFKC).

translate_table property

translate_table: dict[int, str]

The static str.translate table this step applies — the fused-engine seam (0018).

RemoveTatweel dataclass

Strip tatweel ـ (U+0640) — lossless encoding repair.

English: tatweel / kashida removal. Tatweel only stretches a word visually for justification; deleting it collapses elongated spellings (محـــمد → محمد) without touching any letter or vocalization mark.

translate_table property

translate_table: dict[int, None]

The static str.translate table this step applies — the fused-engine seam (0018).

UnifyLookalikes dataclass

Unify look-alike kaf/yeh/heh to Arabic letters — lossless encoding repair.

English: look-alike unification. Under the Arabic-language assumption, letters from other Arabic-script orthographies (Persian keheh ک, Farsi yeh ی, the heh-family forms) are encoding artifacts and fold to the Arabic letter (ک→ك, ی→ي, ھ/ہ/ە→ه). One accepted residual: a Persian yeh used word-finally merges على→علي (U+06CC is indistinguishable from alef maqsura).

translate_table property

translate_table: dict[int, str]

The static str.translate table this step applies — the fused-engine seam (0018).

CollapseWhitespace dataclass

Collapse whitespace runs — keeping line breaks by default — lossless encoding repair.

English: whitespace collapse. Each whitespace run collapses to a single character, so equality and tokenization stop depending on how many (or which) spaces a source used: a horizontal run becomes one ASCII space, and a run containing a line break becomes a single "\n". Line structure is preserved by default — flattening it to spaces is lossy, not lossless, so it is opt-in via collapse_lines=True (the recall-oriented behavior SEARCH wants). See ADR-0010. Runs collapse but are not trimmed, so the step stays a fixed point.

Trim dataclass

Strip leading and trailing whitespace — lossless encoding repair.

English: trimming. CollapseWhitespace deliberately does NOT trim — collapsing an edge run in place is what keeps it a fixed point — so edge whitespace survives every profile. This separate, explicit step removes it (str.strip(), so every Unicode whitespace counts), keeping both contracts clean: collapse stays a fixed point, trim is its own idempotent operation a caller composes when wanted. Edge whitespace carries no linguistic signal, so safety is ENCODING_REPAIR. Positional (start/end), hence contextual — its own pass, not a 0018 fusion candidate.

Linguistic folding (lossy)

RemoveTashkeel dataclass

Remove tashkeel — diacritics / vocalization marks — by class — lossy linguistic folding.

English: dediacritization. The first lossy step and araclean's headline differentiator: which mark classes to remove is chosen independently (story 26), so a caller can strip harakat while keeping a meaningful shadda, drop only tanween, etc. Removal deletes the marks alone and never a carrier letter (a tanween over an alef goes; the alef stays). safety is LINGUISTIC_FOLDING, so it never runs under LIGHT: it is opt-in via a lossy profile or an explicit step (ADR-0004).

classes defaults to every MarkClass. Sukun rides with HARAKAT (it is the absence of a vowel, not a haraka, but stripping the vowels while leaving a bare sukun is never wanted). The orthographic combining madda U+0653 is removed with MADDA; the alef-with-madda letter آ U+0622 is letter folding (issue 0007), kept here.

position selects WHERE the chosen marks are removed: "all" (the default) everywhere via one str.translate pass; "final" only a WORD-FINAL run of them — the i3rab fold (drop the case vowel, keep the word-internal vocalization: كِتَابٌ → كِتَاب), PyArabic's strip_lastharaka parity. A trailing run followed by an unselected mark counts as word-internal and is kept. "final" is a contextual regex rule, so in that mode the step stays its own pass and does not join the 0018 fused-translate engine (its translate_table raises AttributeError, which the planner reads as "not fusible").

translate_table property

translate_table: dict[int, None]

The precomputed str.translate deletion table — the fused-engine seam (0018).

Only position="all" IS one translate pass; "final" is contextual, so this raises AttributeError and the fusion planner leaves the step as its own pass.

FoldAlef dataclass

Fold the alef variants أ إ آ ٱ to bare alef ا — lossy linguistic folding.

English: alef folding. The hamza-/madda-bearing alef letters, alef-wasla, and the wavy-hamza alefs collapse to the plain alef (أ/إ/آ/ٱ/ٲ/ٳ → ا), so spelling variation in how an initial alef was written stops splitting otherwise-identical words. It discards a real orthographic distinction, so safety is LINGUISTIC_FOLDING: opt-in via a lossy profile or an explicit step, never under LIGHT. (Historical/manuscript alefs that are not contemporary Arabic — e.g. the high-hamza alef U+0675, the Extended-B annotation alefs — are deliberately left alone.)

translate_table property

translate_table: dict[int, str]

The static str.translate table this step applies — the fused-engine seam (0018).

FoldAlefMaqsura dataclass

Fold alef maqsura ى to yeh ي — lossy linguistic folding.

English: alef-maqsura folding. The dotless final ى (a long-alef sound) folds to yeh ي so the two spellings stop splitting a word. This merges على and علي, a genuine distinction, so the fold is LINGUISTIC_FOLDING and never runs under LIGHT: it is opt-in for recall (SEARCH).

translate_table property

translate_table: dict[int, str]

The static str.translate table this step applies — the fused-engine seam (0018).

FoldHamza dataclass

Fold hamza off its carriers ؤ→و, ئ→ي — separate and configurably aggressive — lossy folding.

English: hamza folding. A toggle kept separate from FoldAlef so hamza can be neutralized on the waw/yeh carriers (ؤ→و, ئ→ي) without folding alef. Folding lightly (the default) folds the carriers and deletes the combining hamza marks U+0654/U+0655 (hamza seated on a carrier — the letter content issue 0006 routes here, not to RemoveTashkeel). Folding heavily (drop_standalone_hamza=True) also drops the standalone hamza ء U+0621 and the high hamza ٴ U+0674. The precomposed alef-hamza letters أ/إ are alef variants, left to FoldAlef. safety is LINGUISTIC_FOLDING.

translate_table property

translate_table: dict[int, str | None]

The precomputed str.translate table — the fused-engine seam (0018).

FoldTehMarbuta dataclass

Fold teh marbuta ة to a configurable target (heh by default) — lossy linguistic folding.

English: teh-marbuta folding. The word-final "tied taa" ة (and its goal form ۃ) folds to TehMarbutaTarget.HEH ه (the common search fold, default), TEH ت (its underlying value), or is left in place with KEEP. ة marks a real grammatical ending, so the fold discards information: safety is LINGUISTIC_FOLDING, never run under LIGHT.

translate_table property

translate_table: dict[int, str]

The precomputed str.translate table (empty for KEEP) — fused-engine seam (0018).

FoldTanweenAlef dataclass

Drop the word-final tanween-fath carrier alef: كتاباً → كتاب — lossy linguistic folding.

English: tanween-alef folding. The adverbial-accusative ending writes its tanween-fath on a carrier alef (كتاباً, or the same pair typed tanween-first as كتابًا); for recall (SEARCH) the whole ending folds away so the inflected spelling matches the bare كتاب. RemoveTashkeel alone cannot do this — it strips only the mark, leaving كتابا, a different spelling — so this step MUST RUN BEFORE dediacritization, while the tanween still marks which alef is a carrier (the SEARCH ordering). A tanween seated directly on a letter (خطأً، مدرسةً) has no carrier and is left to RemoveTashkeel; only the standard fathatan U+064B participates.

It discards a real grammatical ending, so safety is LINGUISTIC_FOLDING: opt-in via SEARCH or an explicit step, never under LIGHT. A contextual re rule (word-final anchoring), so it stays its own pass and is not a candidate for the 0018 fused-translate engine (ADR-0006).

MapDigits dataclass

Convert digits among Arabic-Indic / Extended / ASCII to a target — lossy linguistic folding.

English: digit mapping. Every digit — Arabic-Indic ٠-٩, Extended (Persian/Urdu) ۰-۹, or ASCII 0-9 — is rewritten to the chosen DigitTarget by numeric value, so numbers parse and match consistently regardless of how they were typed (story 31). The default target is ASCII. The map erases which script a digit was written in, so safety is LINGUISTIC_FOLDING: opt-in via a lossy profile or an explicit step, never under LIGHT.

The dedicated Arabic number separators (decimal ٫ U+066B, thousands ٬ U+066C) are NOT digits, so by default they stay — ١٢٫٥ becomes the mixed-script 12٫5. The opt-in map_separators=True also rewrites a separator when digit-flanked on BOTH sides (٫ → ., ٬ → ,; the inverse of MapPunctuation's guard), giving 12.5; a stray separator outside a number is never touched. That guard is a contextual regex, so with the knob on the step stays its own pass and does not join the 0018 fused-translate engine (its translate_table raises AttributeError, which the planner reads as "not fusible").

translate_table property

translate_table: dict[int, str]

The precomputed str.translate table this step applies — fused-engine seam (0018).

Only the pure digit map IS one translate pass; with map_separators=True the digit-flanked guard is contextual, so this raises AttributeError and the fusion planner leaves the step as its own pass.

MapPunctuation dataclass

Map Arabic punctuation ، ؛ ؟ to Latin , ; ? — number-separator-safe — lossy folding.

English: punctuation mapping. The Arabic comma ،, semicolon ؛ and question mark ؟ fold to their Latin equivalents so one tokenizer/sentence-splitter works on Arabic text (story 32). A mark sitting between two digits is a numeric separator (e.g. a thousands-grouped number) and is preserved, not turned into sentence punctuation; the dedicated decimal/thousands/date separators are never touched. The fold erases the script of the punctuation, so safety is LINGUISTIC_FOLDING, never run under LIGHT.

RemovePunctuation dataclass

Delete every Unicode punctuation character (category P*) — lossy linguistic folding.

English: punctuation removal. The bag-of-words / classification staple every incumbent ships: for token-frequency features, punctuation is noise. One stated principle: a code point is removed iff its Unicode general category is P (Po/Pd/Ps/Pe/Pi/Pf/Pc) — which covers the Arabic marks ، ؛ ؟ ٪ ۔ as much as ASCII and every other script's punctuation, re-derived from the live UCD so it tracks Unicode releases. Symbols (S: $ + = ~), digits and letters are not punctuation and pass through. keep carves out characters to preserve (each entry one character).

Distinct from MapPunctuation (which REWRITES the three Arabic sentence marks to their Latin equivalents for tokenizer uniformity): this step DELETES, so the two compose — map first if you want the Latin marks, or just remove everything. Deleting sentence structure is lossy, so safety is LINGUISTIC_FOLDING: opt-in, never under LIGHT. The whole behavior is one str.translate, so it is fusible (0018).

translate_table property

translate_table: dict[int, None]

The precomputed str.translate deletion table — the fused-engine seam (0018).

MapQuotes dataclass

Fold typographic quotation marks to the straight ASCII pair — lossy linguistic folding.

English: quote normalization. Arabic text quotes with guillemets «», and word processors emit the curly/low-9 variants; folding them all to " / ' (by visual family — double to double, single to single) gives a tokenizer one quote vocabulary. It erases the quote style, so safety is LINGUISTIC_FOLDING: opt-in via an explicit step, never under LIGHT and in no built-in profile. One str.translate pass, so it is fusible (0018).

translate_table property

translate_table: dict[int, str]

The static str.translate table this step applies — the fused-engine seam (0018).

ReduceElongation dataclass

Collapse runs of >= min_run repeated Arabic letters to cap copies — lossy folding.

English: elongation reduction. Word-lengthening repeats a letter for emphasis (جمييييل, راااائع); this collapses such a run so emphatic spellings stop exploding the vocabulary. Two knobs, because the TRIGGER and the TARGET are different decisions:

  • min_run — what counts as elongation: only a run of at least min_run copies collapses. Defaults to max(cap + 1, 3): Arabic spells true doubled letters constantly (the assimilated definite article الله/اللغة, verb prefixes تتكلم, lexical doubles ممكن/مما), while a TRIPLED letter is virtually nonexistent in real spelling — so 3+ is the safe elongation signal (the literature's standard rule) and a double is presumed legitimate. A 2-copy emphatic spelling is indistinguishable from a legitimate double without a lexicon, so it is deliberately left alone.
  • cap — what a run reduces to: cap=1 (the default) collapses to the canonical single letter, so جمييييل merges with جميل (what ML/SEARCH want); cap=2 keeps a doubled letter so emphasis survives as a feature (what SOCIAL wants — its trigger is already 3, so its behavior is identical with the default min_run).

Only contemporary Arabic letters are capped; digits are never touched (a repeated digit is a number, not emphasis, so 1000 stays 1000), nor are tashkeel marks or tatweel. The fold discards the emphasis, so safety is LINGUISTIC_FOLDING: opt-in via a lossy profile or an explicit step, never under LIGHT. It is a contextual re rule, so it stays its own pass and is not a candidate for the 0018 fused-translate engine (ADR-0006).

RemoveStopwords dataclass

Remove curated Arabic stopwords — function-word filtering — lossy linguistic folding.

English: stopword removal. Deletes whole-token occurrences of the bundled, versioned Arabic stopword list (araclean.stopwords) — prepositions, pronouns, demonstratives, relative pronouns, neutral conjunctions and particles — so high-frequency function words stop drowning out content words (IR/retrieval, bag-of-words features). It discards linguistic content from the Arabic text, so safety is LINGUISTIC_FOLDING: opt-in via a lossy profile or an explicit step, never under LIGHT (it is content removal, not non-linguistic-noise cleaning — ADR-0011).

Two deliberate properties (story 37): the list is flat, not clitic-aware (ADR-0001), so a prefixed/suffixed form like والكتاب / فيها is kept — only a standalone token is removed; and it is negation-safe — the polarity particles ما / لا / لم / لن / ليس are excluded so removal can never flip a sentence's polarity. A removed token leaves its whitespace as written (a gap), like the other delete-style steps (CleanURLs); a later CollapseWhitespace tidies the gaps. The list version is serialized so a Profile pins it reproducibly.

ORDERING CONTRACT (enforced): the list ships in FOLDED form (araclean.stopwords), so this step must run AFTER dediacritization and the letter folds — requires_before names them, and Pipeline rejects at construction any pipeline where they do not precede this step. Folding first is what makes matching robust: real typed Arabic routinely omits hamza (انا، الى) and vocalized text never matches a bare list, but after RemoveTashkeel + FoldAlef + FoldAlefMaqsura + FoldHamza every spelling variant lands on the one folded form the list carries. The folds are idempotent and cheap, so a pipeline over already-normalized text simply includes them as no-ops.

Cleaning (lossy)

CleanURLs dataclass

Remove URLs or replace them with a placeholder token — cleaning (non-linguistic noise).

English: URL cleaning. A scheme- (http/https) or www.-prefixed run is metadata noise, not Arabic content, so it is DELETEd (the default) or, in PLACEHOLDER mode, replaced by the placeholder token — the AraBERT [رابط]/[URL] expectation, kept first-class. The default token is the English [URL]; pass placeholder="[رابط]" for the Arabic one. Matching is conservative (only http(s):// and www. anchor it), so ordinary prose is safe.

safety is CLEANING: it discards non-linguistic noise, a sibling of linguistic folding, so it never runs under LIGHT — opt-in via a lossy profile (SOCIAL) or an explicit step (ADR-0011).

CleanMentions dataclass

Remove @mentions or replace them with a placeholder token — cleaning (non-linguistic noise).

English: mention cleaning. An @-handle is metadata noise, so it is DELETEd (the default) or, in PLACEHOLDER mode, replaced by the placeholder token (the AraBERT [مستخدم]/ [MENTION] expectation, kept first-class; the default token is the English [MENTION]). A handle is @ plus Unicode word characters, so an Arabic handle @محمد is matched as readily as @user; a bare @ with no following word character is left alone. An EMAIL ADDRESS is recognized before the mention shape and kept verbatim — user@example.com is an address, not a mention to rewrite into user[MENTION].com. The email shape requires a dotted domain, so the dotless user@example still has its host read as a mention (documented residual).

safety is CLEANING: it discards non-linguistic noise, never run under LIGHT (ADR-0011).

CleanHashtags dataclass

Segment, remove, or replace #hashtags — cleaning (social-metadata markup), no-op when kept.

English: hashtag handling. A #tag is social metadata wrapping real words — in Arabic social text often a full phrase (#اليوم_الوطني_السعودي). The default SEGMENT mode applies the entrenched AraBERT recipe: drop the #, map _ to a space, so the words stay in the text as content (what SOCIAL pins). DELETE removes the tag outright; PLACEHOLDER swaps in the placeholder token (default the English [HASHTAG]; pass an Arabic one explicitly); KEEP leaves tags untouched, so a config override can pin "do not touch hashtags".

A tag is # plus Unicode word characters (Arabic matches as readily as Latin; _ is a word character, so multi-word tags match whole). In SOCIAL, CleanURLs runs FIRST, so a URL fragment (…/page#section) is gone before this step could read it as a tag. safety is mode-dependent, like HandleEmoji: KEEP is a lossless no-op (ENCODING_REPAIR); the rewriting modes are CLEANING (ADR-0011).

CleanHTML dataclass

Strip HTML tags and unescape entities — cleaning (non-linguistic noise).

English: HTML cleaning. Markup is noise around the text: each tag is DELETEd (the default, so you keep the inner text) or, in PLACEHOLDER mode, replaced by the placeholder token, and HTML entities are always unescaped (&amp;&, &lt;<), which a tag-only strip would miss. Tags are removed BEFORE unescaping, so an intentionally escaped &lt;b&gt; stays literal text instead of being decoded into a <b> tag and then stripped.

safety is CLEANING: it discards non-linguistic noise, never run under LIGHT (ADR-0011). Strict idempotence does not hold over arbitrary text — html.unescape decodes only one level, so a multiply-encoded entity (&amp;amp;&amp;&) changes on each pass — but on realistic single-encoded markup the step is a fixed point.

SCOPE BOUNDARY: this is a tag stripper, not an HTML parser. The content of a container element survives even when its tags go — including <script> and <style>, whose JavaScript/CSS text is kept as text. Fine for the social-snippet case this step serves; for web-scrape corpora, strip script/style containers with a real HTML parser before araclean.

HandleEmoji dataclass

Keep, strip, or demojize emoji — cleaning (non-linguistic noise), or a no-op when kept.

English: emoji handling. Social text carries affective signal in emoji, so the default KEEP leaves them untouched (a lossless no-op). STRIP removes them; DEMOJIZE rewrites each to its text alias (😍 → :smiling_face_with_heart_eyes:) so the signal survives as words a tokenizer can read. safety is therefore mode-dependent: KEEP is ENCODING_REPAIR (lossless, safe under LIGHT); STRIP/DEMOJIZE are CLEANING (opt-in noise removal — ADR-0011).

DEMOJIZE needs the optional emoji library (the [emoji] extra), resolved once at construction so the per-string call stays setup-free; building a DEMOJIZE step without the extra raises EmojiSupportNotInstalledError. KEEP/STRIP need no dependency — STRIP recognizes emoji from a built-in Unicode set, so the lean core covers it (it strips a whole ZWJ sequence, leaving a standalone joiner — invisible formatting owned by StripBidi — alone).

RemoveForeign dataclass

Remove non-Arabic-script letter spans or replace them with a placeholder token — cleaning.

English: foreign-span removal. The standard Arabic corpus-prep filter: for an Arabic corpus, embedded foreign words are noise, so a maximal run of non-Arabic-script LETTERS (category L* outside the Arabic blocks, with any combining marks riding along — a decomposed café travels whole) is DELETEd (default) or, in PLACEHOLDER mode, replaced by the placeholder token (default the English [FOREIGN]; pass [أجنبي] explicitly). A span must START with a letter, so a lone combining mark — the VS16 after an emoji, a stray accent — never opens a span and emoji are untouched. Digits, punctuation, whitespace and symbols pass through: this filters foreign WORDS, not structure (RemovePunctuation / MapDigits own those concerns).

safety is CLEANING: for the Arabic-corpus contract, non-Arabic-script content is surrounding noise like a URL, not an Arabic-internal distinction (ADR-0011) — and like every cleaning step it is opt-in, never under LIGHT. Deletion leaves whitespace gaps; a later CollapseWhitespace tidies them. A contextual rule over a UCD-derived span pattern (built lazily, once per process), so it stays its own pass (ADR-0006).

Step options

The closed option sets the configurable steps accept (as enum members or their string values).

MarkClass

Bases: StrEnum

A class of tashkeel marks RemoveTashkeel can remove independently (story 26).

English: diacritic class. The vocalization-mark taxonomy (GLOSSARY: Tashkeel) split into the units a caller selects between. SUKUN is not a member — it is the vowelless mark (the absence of a vowel, not a haraka), removed together with HARAKAT for convenience and not selectable on its own (GLOSSARY: Harakat).

TehMarbutaTarget

Bases: StrEnum

What FoldTehMarbuta rewrites the teh marbuta ة to (story 29).

English: teh-marbuta target. HEH (the common search fold, default) and TEH (its underlying value) are the standard targets; KEEP leaves ة in place so a profile can pin "do not fold".

DigitTarget

Bases: StrEnum

Which digit system MapDigits converts every digit to (story 31).

English: digit target. ASCII (default) makes numbers parse and match consistently; the two Arabic systems are ARABIC_INDIC (Eastern ٠-٩) and EXTENDED_ARABIC_INDIC (Persian/Urdu ۰-۹).

CleanMode

Bases: StrEnum

Whether a cleaning step deletes the matched noise or replaces it with a placeholder token.

English: cleaning mode. DELETE (default) removes the span outright; PLACEHOLDER swaps in a fixed token (e.g. [URL]) so a model keeps "a link was here" as a feature without a noisy unique value — the entrenched AraBERT expectation, so it is first-class, not just delete.

EmojiMode

Bases: StrEnum

How HandleEmoji treats emoji (story 35).

English: emoji handling. KEEP (default) leaves emoji in place so affective signal survives; STRIP removes them; DEMOJIZE replaces each with its text alias (😍 → :heart_eyes:), which needs the optional emoji library (the [emoji] extra).

HashtagMode

Bases: StrEnum

How CleanHashtags treats a #hashtag (roadmap Phase 1).

English: hashtag handling. SEGMENT (default — the entrenched AraBERT recipe) drops the # and maps _ to a space, so the tag's words survive as text; DELETE removes the tag; PLACEHOLDER swaps in a fixed token; KEEP leaves it untouched (the no-op a config override can pin).

Extension & protocol types

Step

Bases: Protocol

A single normalization transform: a safety class plus a pure str -> str call.

The reserved, optional alignment hook apply_aligned(s) -> (str, OffsetMap) (ADR-0005) is intentionally absent from this contract; a Pipeline detects it when present and raises a clear error otherwise. Steps precompute any table/regex at construction so __call__ does no setup and no validation (ADR-0006).

A step satisfies the contract by exposing a readable safety attribute — the natural idiom is a class-level safety = SafetyClass.… assignment (what built-in and custom steps both use). It is a read-only member: a step's safety class is an intrinsic trait, never reassigned, so a class variable, a frozen field (when the class varies it by config, like HandleEmoji), or a property all satisfy it.

Stopword data

The curated, versioned list behind RemoveStopwords — see the stopwords guide. STOPWORDS (the matching set), STOPWORDS_VERSION, STOPWORDS_LICENSE, and NEGATION_PARTICLES (the polarity particles deliberately excluded) are importable from the module:

stopwords

The curated, versioned Arabic stopword list backing RemoveStopwords (issue 0017).

Provenance — this list is freshly authored for araclean from common Modern Standard Arabic function words (prepositions, pronouns, demonstratives, relative pronouns, conjunctions, and neutral particles). It is not derived from the GPL-licensed Arabic-Stopwords package or any other copyleft source, so it does not encumber araclean's MIT core; the list itself is dedicated to the public domain under CC0-1.0 (STOPWORDS_LICENSE). It carries a STOPWORDS_VERSION so a Profile can pin an exact list and stopword removal stays reproducible across releases.

Design properties (surfaced in the docs, issue 0023):

  • FOLDED FORM — run AFTER the letter folds (list version 2). Every entry is stored in its letter-folded spelling (no hamza-bearing alef أ/إ, no alef maqsura ى, no hamza carrier ؤ/ئ — the output of FoldAlef + FoldAlefMaqsura + FoldHamza), and RemoveStopwords requires those folds plus RemoveTashkeel to run before it (Pipeline enforces this at construction). The pipeline itself is the spelling-variant generator: the canonical إلى, the routinely-typed hamza-less الى, and the vocalized إلَى all fold to the one entry الى→الي, so matching is robust without enumerating variants. _validated enforces fold-stability per entry at import.
  • Flat, not clitic-aware (ADR-0001 — no morphology). Each entry is a whole bare token; the list does not know about proclitics/enclitics, so والكتاب (and-the-book) and فيها (in-it) are kept — only a standalone و / في token would be removed.
  • Negation-safe by default. The polarity-bearing particles ما / لا / لم / لن / ليس (NEGATION_PARTICLES) are deliberately excluded, so RemoveStopwords can never silently flip the sentiment of a sentence by deleting its negation (story 37).
  • Homograph policy — one stated principle. An entry is kept iff its FUNCTION-word reading dominates its content readings by token frequency in running text (in the folded spelling, since folding merges words). Dropped under that principle: أم (or — but commonly mother: أم محمد), كم (how many — but also km), نفس (same — but commonly soul/self/breath), كأن (as if — folds onto كان was, one of the most frequent verbs), أية (any fem. — rare standalone, and its folded ايه is the dialectal what?). Kept despite a known collision: علي (the fold of the preposition على — the preposition dwarfs the name Ali in running text; the residual is that a bare standalone علي is removed), لان (the fold of لأن because — the verb reading softened is rare), ان (the shared fold of إن and أن — the rare آن time also lands there and is accepted).

DataFrame accessors

Installed by importing araclean.pandas / araclean.polars (each needs its extra) — see pandas & polars.

AracleanAccessor

The .araclean Series accessor: series.araclean.normalize(profile=..., **overrides).

pandas instantiates this with the Series the accessor was reached through.

normalize

normalize(
    *, profile: str | None = None, **overrides: object
) -> pd.Series[Any]

Normalize each value in the Series with a named profile (default LIGHT) + overrides.

Equivalent to series.map(lambda x: normalize(x, profile=..., **overrides)) but builds the pipeline once. profile and the per-knob **overrides (e.g. map_digits=True, emoji="strip") are validated through the config trust boundary, so an unknown profile, knob, or value raises the same clear error as the normalize facade. Missing values (NaN/None) pass through unchanged (na_action="ignore"); empty strings normalize to empty strings.

AracleanNamespace

The .araclean Series namespace: series.araclean.normalize(profile=..., **overrides).

polars instantiates this with the Series the namespace was reached through.

normalize

normalize(
    *, profile: str | None = None, **overrides: object
) -> pl.Series

Normalize each value in the Series with a named profile (default LIGHT) + overrides.

Equivalent to mapping normalize(x, profile=..., **overrides) element-wise but builds the pipeline once. profile and the per-knob **overrides (e.g. map_digits=True, emoji="strip") are validated through the config trust boundary, so an unknown profile, knob, or value raises the same clear error as the normalize facade. Null values pass through unchanged (map_elements skips nulls); empty strings normalize to empty strings. The result is a String Series, matching the pandas accessor value-for-value.

Errors

AlignmentNotSupportedError

Bases: NotImplementedError

Raised when offsets/alignment are requested through a step that lacks apply_aligned.

Offset tracking is reserved but not implemented in v1 (ADR-0005). Subclasses NotImplementedError so callers probing for the capability can fall back.

EmojiSupportNotInstalledError

Bases: ImportError

Raised when HandleEmoji(mode="demojize") is built without the optional emoji extra.

KEEP/STRIP need no dependency; only DEMOJIZE requires the emoji library, kept out of the lean MIT core (ADR-0003). Subclasses ImportError so a caller probing for the capability can catch it; the message says how to install the extra.

CLIExtraNotInstalledError

Bases: ImportError

Raised when the CLI is built without the optional [cli] extra (Typer) installed.

Subclasses ImportError (so a caller probing for the capability can catch it); the message says how to install it. Mirrors EmojiSupportNotInstalledError for the [emoji] extra.

PandasExtraNotInstalledError

Bases: ImportError

Raised when the pandas accessor is used without the optional [pandas] extra installed.

Subclasses ImportError (so a caller probing for the capability can catch it); the message says how to install it. Mirrors EmojiSupportNotInstalledError / CLIExtraNotInstalledError.

PolarsExtraNotInstalledError

Bases: ImportError

Raised when the polars namespace is used without the optional [polars] extra installed.

Subclasses ImportError (so a caller probing for the capability can catch it); the message says how to install it. Mirrors PandasExtraNotInstalledError / CLIExtraNotInstalledError.