Reproducible preprocessing¶
Preprocessing is part of your method. If a paper, a teammate, or your future self cannot rerun exactly the normalization a dataset went through, results stop being comparable — and Arabic preprocessing is where that silently goes wrong most often. araclean treats the preprocessing configuration as a first-class, serializable artifact.
Name the profile, pin the version¶
The cheapest reproducibility statement is a profile name plus the araclean version:
"Text was normalized with araclean X.Y.Z, profile
search."
Profiles are versioned with the library and never change silently: behavior changes arrive as version bumps with changelog entries (the version itself is derived from Conventional Commits). Published docs are versioned too — the version selector on this site pins the docs to the release you installed.
Serialize the exact pipeline¶
A Pipeline round-trips through a plain, JSON-friendly dict — every step name and its full
configuration:
>>> import json
>>> from araclean import Pipeline
>>> pipe = Pipeline.from_profile("search").drop("MapDigits")
>>> payload = json.dumps(pipe.to_dict(), ensure_ascii=False) # ship this with your dataset
>>> restored = Pipeline.from_dict(json.loads(payload))
>>> restored("اَلسّلامُ عليكم") == pipe("اَلسّلامُ عليكم")
True
This captures adapted pipelines too — the drop above is in the payload, not a footnote.
Custom steps join the round-trip once they implement to_dict and register a
factory.
Data that can drift is pinned inside the serialized form. RemoveStopwords, for example, embeds
the bundled stopword-list version, and refuses to rehydrate against a release whose list differs:
>>> from araclean import RemoveStopwords
>>> RemoveStopwords().to_dict()
{'name': 'RemoveStopwords', 'config': {'version': '2.0.0'}}
Serialize a configured call¶
If you stayed at the normalize level, the same applies to its configuration: NormalizeConfig
is a frozen pydantic model, so a tuned call serializes to a few bytes of JSON and validates on the
way back in:
>>> from araclean import NormalizeConfig, normalize
>>> config = NormalizeConfig(profile="ml", map_digits=True)
>>> config.model_dump_json(exclude_defaults=True)
'{"profile":"ml","map_digits":true}'
>>> normalize("٢٠٢٤", config=NormalizeConfig.model_validate_json('{"profile":"ml","map_digits":true}'))
'2024'
Because the model is extra="forbid" with closed enums, a stale or typo'd config from disk fails
loudly instead of running something subtly different. There is a published JSON Schema for it, so
non-Python tooling can validate configs too:
>>> schema = NormalizeConfig.model_json_schema()
>>> schema["title"], schema["additionalProperties"]
('NormalizeConfig', False)
State what was lost¶
A methods section should say not just what ran but what it discarded. The audit gives you that sentence from the pipeline itself:
>>> report = Pipeline.from_profile("ml").audit()
>>> report.lossless
False
>>> report.lossy_steps
('RemoveTashkeel', 'ReduceElongation')
See the safety contract for the classes the report buckets steps into.
The full recipe¶
For a dataset card or appendix, ship three things:
- the araclean version (
araclean.__version__), - the serialized pipeline (
pipe.to_dict()) or config (config.model_dump_json()), - the audit summary (
pipe.audit()).
Anyone with pip install araclean==X.Y.Z and the payload reproduces your preprocessing exactly —
no notebook archaeology required.