Skip to content

Architecture & performance

araclean is a small library with a deliberate shape: validate once at the edge, run a precompiled plan in the middle, and keep every behavior in one deep core that several thin adapters share.

Three layers, one validation boundary

Layer Surface Validates?
3 — facade normalize(text, profile=…, **overrides) yes — the trust boundary
2 — composition Pipeline, Profile, named presets at construction only
1 — primitives Step objects and their bare functions (remove_tatweel(s), …) no — the hot path

All untrusted input — profile names, override knobs, serialized configs — is validated at the facade through one pydantic model (NormalizeConfig, a closed set of enums and bounded scalars), so a bad option fails there with a clear error. Below the boundary, nothing validates per string: steps precompute their tables and regexes at construction, and Pipeline.__call__ just runs them. That split is why the facade can be friendly and the per-string path fast.

The CLI, the pandas accessor, and the polars namespace are adapters at the same seam: each parses its own input format, passes through the same boundary once, then streams or maps the resulting pipeline. None of them contains normalization logic, which is why behavior is identical across all four entry points.

The fused execution engine

Most repair and fold steps are, in implementation terms, a single str.translate over a static table — a context-free, per-character map that never re-scans its own output. Composing two such maps is itself such a map, so when a pipeline contains a run of consecutive translate steps, araclean fuses the run into one combined table applied in a single C-level pass at pipeline construction. SEARCH, for example, composes 18 steps but executes 11 passes; LIGHT's 7 steps run as 5.

The fusion is exact — output is identical to running the steps one by one — and invisible: pipe.steps, repr, select/drop, audit() and serialization all speak in terms of the steps you composed. Contextual steps (regex-based cleaning, ReduceElongation, NormalizeUnicode) cannot be expressed as a per-character map, so each stays its own pass, in order, and fusion happens within the stretches between them.

In the repo's benchmark suite this is worth roughly a 1.7× throughput advantage over the incumbent multi-pass approach on character-level normalization, measured continuously in CI; an asv suite additionally tracks araclean against itself across commits so regressions are caught. Custom steps whose behavior is one translate table can opt into fusion by exposing translate_table.

Invariants the pipelines maintain

Every built-in profile guarantees, by construction:

  • Output is NFC and whitespace-collapsed. Profiles open and close with a Unicode NFC pass (folds in the middle can re-expose non-canonical mark order), and close with a whitespace collapse (deletions leave gaps). The closing pair makes the postcondition unconditional.
  • Idempotence. Normalizing an already-normalized text is a no-op:
>>> from araclean import normalize
>>> out = normalize("اَلسّلامُ عليكم", profile="search")
>>> normalize(out, profile="search") == out
True
>>> normalize(out) == out  # lossy output is already LIGHT-stable too
True
  • Profile containment. SEARCH, ML, and SOCIAL all begin with LIGHT's exact steps, so "every profile does everything LIGHT does" holds by construction, and ML sits strictly between LIGHT and SEARCH in what it removes.

These are pinned by property-based tests (Hypothesis) and snapshot tests in the repository, not just asserted in prose.

Correctness machinery behind the scenes

Worth knowing even if you never touch it:

  • Whole-category character tables are derived from the live Unicode Character Database and enforced by invariant tests, so e.g. the tashkeel repertoire tracks Unicode releases instead of a hand-typed list going stale.
  • Differential oracles: araclean's folds are cross-checked against an independent implementation (pyarabic, as a dev-only test oracle — it never ships as a dependency) on generated inputs.
  • Generated docs with drift guards: the per-profile pages, the glossary tooltips, and the CLI reference are generated from the code and re-checked by tests, and every Python example in these docs runs as a doctest in CI.

Reserved seams

apply_aligned() — offset/alignment tracking that maps normalized spans back to the original text — is reserved on every step and on Pipeline, and currently raises a clear AlignmentNotSupportedError. It is the designed next major feature (the architecture's per-code-point tables and anchored regexes were chosen to make it tractable); until then the contract is honest: it fails loudly rather than approximating.