Skip to content

FAQ

Is the default really lossless?

Yes, with a precise definition: the default LIGHT profile is encoding repair — it discards no linguistic signal from Arabic-language text. Two edges are part of the contract:

  • Output is canonically equivalent, not byte-identical. araclean emits NFC, so a non-canonically-encoded input comes back with the same marks in canonical order.
  • The Arabic-language assumption. Look-alike unification (ک→ك, ی→ي, the heh family) is correct for Arabic text, where those code points are keyboard/encoding artifacts. The one residual: a Farsi yeh ی typed word-finally is visually identical to alef maqsura ى, so such input can merge على→علي even under LIGHT. If that edge matters to your corpus, drop the step: Pipeline.from_profile("light").drop("UnifyLookalikes").

See the safety contract for the full reasoning.

Why did my tashkeel / alef hamza / ة disappear?

You ran a lossy profile. Only SEARCH, ML, and SOCIAL fold or remove anything, and each profile page lists exactly which steps do it. The default (LIGHT) and CLASSICAL never remove a mark. If you want most of a lossy profile but not one fold, use a knob (teh_marbuta="keep", tashkeel_classes={...}) or drop the step.

Can I use it on Persian, Urdu, or other Arabic-script languages?

Not as-is. araclean is Arabic-only by contract: several LIGHT repairs (look-alike unification above all) treat Persian/Urdu letters as encoding artifacts to fold into Arabic ones, which is wrong for those languages. For mixed corpora, separate by language first, or compose an explicit pipeline without the Arabic-assuming steps.

Does araclean stem, lemmatize, segment clitics, or tokenize?

No, deliberately. Morphology needs lexicons or models, which would end the pip-install-and-go core (and most options drag in GPL code, Java, or torch). Compose downstream instead — e.g. snowballstemmer's Arabic algorithm (BSD, zero deps) after an araclean profile. The same goes for dialect ID, Arabizi transliteration, and diacritization restoration. See Why araclean.

Is normalization idempotent?

Yes. Every profile satisfies normalize(normalize(x)) == normalize(x), and lossy-profile output is LIGHT-stable (running LIGHT on SEARCH output changes nothing). Both are pinned by property-based tests.

How do I process a corpus that doesn't fit in memory?

Everything streams. The CLI processes line by line (plain text or JSONL); in Python, Pipeline.batch(texts) is a lazy generator over any iterable. Build the pipeline once, outside the loop.

Can I find out where a normalized span sits in the original text?

Not yet. Offset/alignment tracking (apply_aligned) is reserved on every seam and raises a clear AlignmentNotSupportedError today; it is the planned flagship of a future release, designed for since v1. If you need provenance now, keep the raw text alongside and index by record, not offset.

What does demojize need? Why an extra?

HandleEmoji(mode="demojize") rewrites each emoji to a text alias and needs the emoji library — installed via pip install 'araclean[emoji]'. keep and strip are pure-Python and need nothing. The core stays at one dependency (pydantic); everything else is opt-in extras with actionable errors when missing.

Is it really MIT? I've heard Arabic NLP libraries have license traps.

The distributed package is MIT with a single MIT-compatible dependency (pydantic), and the bundled stopword list is freshly authored, CC0-1.0. GPL software (pyarabic) is used only as a test oracle in the repository's dev environment — it is never imported by, distributed with, or required by pip install araclean or any extra.

How are versions and breaking changes handled?

Versions are derived from Conventional Commits (semver; the project is pre-1.0, so breaking changes bump the minor). Profiles' behavior changes only with a version bump and a changelog entry. The docs are versioned per release — use the version selector to match your installed package, and pin araclean==X.Y.Z together with your serialized pipeline for full reproducibility (see Reproducible preprocessing).

Something looks wrong — where do I report it?

GitHub issues. Bug reports with a minimal input string are gold: every fold is table-driven and tested against the live Unicode database, so a counterexample usually localizes the fix to one table entry.