Skip to content

araclean

Arabic text normalization and cleaning — pure-Python, composable, reproducible, offset-preserving.

araclean fixes mojibake and inconsistent encoding, optionally folds the spelling and vocalization variants that fragment a vocabulary, and does it through one small, serializable interface. Its core install pulls only pydantic — no compiler, Java, or model download.

It is non-destructive by default: the default profile only repairs encoding and never discards linguistic signal. Anything lossy (removing tashkeel, folding alef/hamza, mapping digits) is opt-in through a named profile, and every profile tells you exactly which steps it applies and whether each is lossless or lossy — see Profiles and the safety contract.

Install

pip install araclean

Optional extras: araclean[cli], araclean[pandas], araclean[polars], araclean[emoji], or araclean[all] — see Getting started.

Quickstart

>>> from araclean import normalize
>>> normalize("العـــربية")  # default LIGHT profile: lossless encoding repair (here, drops tatweel)
'العربية'
>>> normalize("اَلسّلامُ عليكم", profile="search")  # SEARCH: lossy folds that maximize recall
'السلام عليكم'

That's the whole surface for batch use. For span-level work — RAG citation, NER projection — apply_aligned returns the normalized text and a map back to every original position:

>>> from araclean import Pipeline, RemoveTatweel, FoldAlef
>>> pipe = Pipeline([RemoveTatweel(), FoldAlef()])
>>> normalized, omap = pipe.apply_aligned("أحمـد")
>>> normalized
'احمد'
>>> omap.to_original((0, 4))   # where does the whole normalized word sit in the original?
(0, 5)

No other Arabic NLP library exposes this. See Offset-preserving normalization.

The default LIGHT profile is safe to run on any corpus; reach for SEARCH, ML, SOCIAL, or CLASSICAL when you want their specific folding. Every Python example in these docs is executed by the test suite, so what you read is what runs.

Where to next

New to araclean?

  • Getting started — install, the first call, and how to pick a profile.
  • Profiles — what each profile does, step by step, lossless vs lossy.

Using it day to day

Understanding it

Looking something up

  • Python API referencenormalize, the Pipeline, and every Step.
  • CLI reference — every flag of araclean normalize, generated from the CLI itself.
  • Glossary — the Arabic terminology, glossed to English (and shown on hover).
  • FAQ — limitations, edge cases, and project policy.