Getting started¶
araclean needs Python 3.12+ and installs in seconds: the core depends only on pydantic — no compiler, no Java, no model download.
Install¶
Everything beyond the Python API lives behind optional extras, so the core stays lean:
| Extra | Installs | Gives you |
|---|---|---|
araclean[cli] |
Typer | the araclean shell command — see the CLI guide |
araclean[pandas] |
pandas | the .araclean Series accessor — see pandas & polars |
araclean[polars] |
polars | the .araclean Series namespace — see pandas & polars |
araclean[emoji] |
emoji | HandleEmoji's demojize mode (keep/strip need nothing) |
araclean[all] |
all of the above | everything |
Using a feature without its extra never crashes with a bare ImportError: you get a clear error
naming the exact pip install command to run.
Your first normalization¶
The whole quick-use surface is one function:
>>> from araclean import normalize
>>> normalize("ﻣﺮﺣﺒﺎ") # OCR/copy-paste presentation forms fold back to real letters
'مرحبا'
>>> normalize("العـــربية") # tatweel (visual elongation) is dropped
'العربية'
With no profile, normalize applies the LIGHT profile: lossless encoding repair. It fixes
the Unicode form, strips invisible bidi/zero-width characters, folds presentation-form glyphs back
to letters, removes tatweel, unifies look-alike letters (Persian keheh ک → Arabic kaf ك), and
collapses whitespace. It never removes tashkeel, never folds alef variants, never touches digits —
it discards no linguistic signal, so it is safe to run on any Arabic corpus, including vocalized or
Qur'anic text.
Choosing a profile¶
Anything lossy is opt-in through a named profile. Pass the name to normalize:
>>> normalize("عَلَى") # LIGHT: vocalization and spelling preserved
'عَلَى'
>>> normalize("عَلَى", profile="search") # SEARCH: tashkeel removed, alef maqsura folded
'علي'
>>> normalize("جميييييل", profile="ml") # ML: dediacritize + collapse emphatic elongation
'جميل'
>>> normalize("رااااائع 😍 https://t.co/xyz", profile="social") # SOCIAL: clean noise, keep emoji
'راائع 😍 [رابط]'
Pick by task:
| You are doing | Use | Why |
|---|---|---|
| anything — you just want clean, consistent text | LIGHT (default) |
lossless; repairs encoding only |
| search / retrieval / matching | SEARCH |
folds spelling & vocalization variants so على matches علي |
| training or feeding a model | ML |
dediacritizes and caps elongation, but keeps letter distinctions that carry signal |
| social-media text | SOCIAL |
cleans URLs/mentions/HTML, segments hashtags, keeps emoji |
| vocalized / classical / Qur'anic text | CLASSICAL |
lossless repair with an explicit every-mark-preserved guarantee |
Each profile page lists the exact steps it runs, in order, each labelled lossless or lossy — the pages are generated from the assembled pipelines themselves, so they cannot drift from the code.
Beyond one call¶
- Process files from the shell:
pip install 'araclean[cli]', thenaraclean normalize corpus.txt --profile search— see the CLI guide. - Normalize a dataframe column:
df["text"].araclean.normalize(profile="search")— see pandas & polars. - Adjust one knob of a profile (
map_digits=True,emoji="strip", …) — see Tuning profiles. - Assemble your own step sequence — see Composing pipelines.
- Understand exactly what you might be discarding — see the safety contract.