Skip to content

Getting started

araclean needs Python 3.12+ and installs in seconds: the core depends only on pydantic — no compiler, no Java, no model download.

Install

pip install araclean

Everything beyond the Python API lives behind optional extras, so the core stays lean:

Extra Installs Gives you
araclean[cli] Typer the araclean shell command — see the CLI guide
araclean[pandas] pandas the .araclean Series accessor — see pandas & polars
araclean[polars] polars the .araclean Series namespace — see pandas & polars
araclean[emoji] emoji HandleEmoji's demojize mode (keep/strip need nothing)
araclean[all] all of the above everything

Using a feature without its extra never crashes with a bare ImportError: you get a clear error naming the exact pip install command to run.

Your first normalization

The whole quick-use surface is one function:

>>> from araclean import normalize
>>> normalize("ﻣﺮﺣﺒﺎ")  # OCR/copy-paste presentation forms fold back to real letters
'مرحبا'
>>> normalize("العـــربية")  # tatweel (visual elongation) is dropped
'العربية'

With no profile, normalize applies the LIGHT profile: lossless encoding repair. It fixes the Unicode form, strips invisible bidi/zero-width characters, folds presentation-form glyphs back to letters, removes tatweel, unifies look-alike letters (Persian keheh ک → Arabic kaf ك), and collapses whitespace. It never removes tashkeel, never folds alef variants, never touches digits — it discards no linguistic signal, so it is safe to run on any Arabic corpus, including vocalized or Qur'anic text.

Choosing a profile

Anything lossy is opt-in through a named profile. Pass the name to normalize:

>>> normalize("عَلَى")                    # LIGHT: vocalization and spelling preserved
'عَلَى'
>>> normalize("عَلَى", profile="search")  # SEARCH: tashkeel removed, alef maqsura folded
'علي'
>>> normalize("جميييييل", profile="ml")   # ML: dediacritize + collapse emphatic elongation
'جميل'
>>> normalize("رااااائع 😍 https://t.co/xyz", profile="social")  # SOCIAL: clean noise, keep emoji
'راائع 😍 [رابط]'

Pick by task:

You are doing Use Why
anything — you just want clean, consistent text LIGHT (default) lossless; repairs encoding only
search / retrieval / matching SEARCH folds spelling & vocalization variants so على matches علي
training or feeding a model ML dediacritizes and caps elongation, but keeps letter distinctions that carry signal
social-media text SOCIAL cleans URLs/mentions/HTML, segments hashtags, keeps emoji
vocalized / classical / Qur'anic text CLASSICAL lossless repair with an explicit every-mark-preserved guarantee

Each profile page lists the exact steps it runs, in order, each labelled lossless or lossy — the pages are generated from the assembled pipelines themselves, so they cannot drift from the code.

Beyond one call

  • Process files from the shell: pip install 'araclean[cli]', then araclean normalize corpus.txt --profile search — see the CLI guide.
  • Normalize a dataframe column: df["text"].araclean.normalize(profile="search") — see pandas & polars.
  • Adjust one knob of a profile (map_digits=True, emoji="strip", …) — see Tuning profiles.
  • Assemble your own step sequence — see Composing pipelines.
  • Understand exactly what you might be discarding — see the safety contract.