Skip to content

CLI reference

The araclean command ships with the [cli] extra (pip install 'araclean[cli]'). This page is generated from the CLI definition itself, so it always matches the installed command; see the command-line guide for task-oriented examples.

araclean normalize

Normalize Arabic text from a file or stdin, writing to a file or stdout.

araclean normalize [OPTIONS] [INPUT]

Reads UTF-8 text line by line — from INPUT, or stdin when it is omitted or - — and writes each normalized line to stdout (or --output). The pipeline is built once, then the text streams through it, so input larger than memory is fine. Option values are validated before any input is read; an override that does not apply to the chosen profile is rejected, never a silent no-op.

Option Values Default What it does
[INPUT] path Input file; reads stdin if omitted or '-'.
--profile, -p light · search · ml · classical · social light Named profile to apply.
--output, -o path Write to this file instead of stdout.
--jsonl flag false Treat input as JSONL; normalize --field of each record.
--field text text JSON field to normalize in --jsonl mode.
--map-digits / --no-map-digits flag ML: also fold digits to ASCII.
--remove-stopwords / --no-remove-stopwords flag SEARCH: also remove the bundled stopword list (after the folds).
--emoji keep · strip · demojize SOCIAL: keep/strip/demojize.
--elongation-cap integer SOCIAL: max repeated letters kept.
--url-mode delete · placeholder SOCIAL: URL handling.
--url-token text SOCIAL: URL placeholder.
--mention-mode delete · placeholder SOCIAL: @mention handling.
--mention-token text SOCIAL: @mention placeholder.
--hashtag-mode segment · delete · placeholder · keep SOCIAL: segment/delete/placeholder/keep.
--hashtag-token text SOCIAL: #hashtag placeholder.
--teh-marbuta heh · teh · keep SEARCH: fold teh marbuta to heh/teh, or keep.
--tashkeel-classes text SEARCH/ML/SOCIAL: comma-separated mark classes to remove (harakat,tanween,shadda,madda,dagger_alef,quranic).
--collapse-lines / --no-collapse-lines flag Flatten line breaks to spaces, or keep line structure (ADR-0010).

A default means "use the chosen profile's own default for that step".

Exit status

Code Meaning
0 Success.
1 A streaming failure: unreadable input/output, or an invalid --jsonl record (the error names the line number).
2 Invalid options — an unknown profile, knob, or value — or the [cli] extra is not installed. Nothing is read.

Errors go to stderr; normalized text is the only thing written to stdout.