CLI reference¶
The araclean command ships with the [cli] extra (pip install 'araclean[cli]'). This page is generated from the CLI definition itself, so it always matches the installed command; see the command-line guide for task-oriented examples.
araclean normalize¶
Normalize Arabic text from a file or stdin, writing to a file or stdout.
Reads UTF-8 text line by line — from INPUT, or stdin when it is omitted or - — and writes each normalized line to stdout (or --output). The pipeline is built once, then the text streams through it, so input larger than memory is fine. Option values are validated before any input is read; an override that does not apply to the chosen profile is rejected, never a silent no-op.
| Option | Values | Default | What it does |
|---|---|---|---|
[INPUT] |
path |
— | Input file; reads stdin if omitted or '-'. |
--profile, -p |
light · search · ml · classical · social |
light |
Named profile to apply. |
--output, -o |
path |
— | Write to this file instead of stdout. |
--jsonl |
flag | false |
Treat input as JSONL; normalize --field of each record. |
--field |
text |
text |
JSON field to normalize in --jsonl mode. |
--map-digits / --no-map-digits |
flag | — | ML: also fold digits to ASCII. |
--remove-stopwords / --no-remove-stopwords |
flag | — | SEARCH: also remove the bundled stopword list (after the folds). |
--emoji |
keep · strip · demojize |
— | SOCIAL: keep/strip/demojize. |
--elongation-cap |
integer |
— | SOCIAL: max repeated letters kept. |
--url-mode |
delete · placeholder |
— | SOCIAL: URL handling. |
--url-token |
text |
— | SOCIAL: URL placeholder. |
--mention-mode |
delete · placeholder |
— | SOCIAL: @mention handling. |
--mention-token |
text |
— | SOCIAL: @mention placeholder. |
--hashtag-mode |
segment · delete · placeholder · keep |
— | SOCIAL: segment/delete/placeholder/keep. |
--hashtag-token |
text |
— | SOCIAL: #hashtag placeholder. |
--teh-marbuta |
heh · teh · keep |
— | SEARCH: fold teh marbuta to heh/teh, or keep. |
--tashkeel-classes |
text |
— | SEARCH/ML/SOCIAL: comma-separated mark classes to remove (harakat,tanween,shadda,madda,dagger_alef,quranic). |
--collapse-lines / --no-collapse-lines |
flag | — | Flatten line breaks to spaces, or keep line structure (ADR-0010). |
A — default means "use the chosen profile's own default for that step".
Exit status¶
| Code | Meaning |
|---|---|
0 |
Success. |
1 |
A streaming failure: unreadable input/output, or an invalid --jsonl record (the error names the line number). |
2 |
Invalid options — an unknown profile, knob, or value — or the [cli] extra is not installed. Nothing is read. |
Errors go to stderr; normalized text is the only thing written to stdout.