araclean¶
Arabic text normalization and cleaning — pure-Python, composable, reproducible, offset-preserving.
araclean fixes mojibake and inconsistent encoding, optionally folds the spelling and vocalization variants that fragment a vocabulary, and does it through one small, serializable interface. Its core install pulls only pydantic — no compiler, Java, or model download.
It is non-destructive by default: the default profile only repairs encoding and never discards linguistic signal. Anything lossy (removing tashkeel, folding alef/hamza, mapping digits) is opt-in through a named profile, and every profile tells you exactly which steps it applies and whether each is lossless or lossy — see Profiles and the safety contract.
Install¶
Optional extras: araclean[cli], araclean[pandas], araclean[polars], araclean[emoji], or
araclean[all] — see Getting started.
Quickstart¶
>>> from araclean import normalize
>>> normalize("العـــربية") # default LIGHT profile: lossless encoding repair (here, drops tatweel)
'العربية'
>>> normalize("اَلسّلامُ عليكم", profile="search") # SEARCH: lossy folds that maximize recall
'السلام عليكم'
That's the whole surface for batch use. For span-level work — RAG citation, NER projection —
apply_aligned returns the normalized text and a map back to every original position:
>>> from araclean import Pipeline, RemoveTatweel, FoldAlef
>>> pipe = Pipeline([RemoveTatweel(), FoldAlef()])
>>> normalized, omap = pipe.apply_aligned("أحمـد")
>>> normalized
'احمد'
>>> omap.to_original((0, 4)) # where does the whole normalized word sit in the original?
(0, 5)
No other Arabic NLP library exposes this. See Offset-preserving normalization.
The default LIGHT profile is safe to run on any corpus; reach for SEARCH, ML, SOCIAL, or
CLASSICAL when you want their specific folding. Every Python example in these docs is executed
by the test suite, so what you read is what runs.
Where to next¶
New to araclean?
- Getting started — install, the first call, and how to pick a profile.
- Profiles — what each profile does, step by step, lossless vs lossy.
Using it day to day
- Offset-preserving normalization — project normalized spans back to original text, for RAG citation and NER/QA span grounding.
- Command line — stream files, stdin/stdout, and JSONL corpora from the shell.
- pandas & polars — normalize a text column in one call.
- Tuning profiles — per-knob overrides (
map_digits=True,emoji="strip", …) with loud validation. - Composing pipelines — build, filter, and reorder your own
Pipeline. - Writing custom steps — drop your own transform into a pipeline.
- Reproducible preprocessing — serialize the exact pipeline a paper or teammate can rerun.
- Stopword removal — the bundled, negation-safe Arabic stopword list.
Understanding it
- Why araclean — the rationale, and what sets it apart.
- The safety contract — lossless vs lossy, and how to audit a pipeline.
- Architecture & performance — the three-layer design and the fused execution engine.
Looking something up
- Python API reference —
normalize, thePipeline, and everyStep. - CLI reference — every flag of
araclean normalize, generated from the CLI itself. - Glossary — the Arabic terminology, glossed to English (and shown on hover).
- FAQ — limitations, edge cases, and project policy.