Offset-preserving normalization¶

araclean's flagship differentiator: normalize text and keep a map from every position in the normalized string back to the original. No other Arabic NLP library exposes this.

The problem¶

When you normalize Arabic for search or ML, you change the text. Tatweel is stripped, alef variants fold, tashkeel disappears. That is fine for indexing — but the moment you need to:

cite a span in the original document (RAG answer grounding)
project a model prediction back to raw text (NER / span annotation)

you need to know where the normalized span came from. Without an offset map, you have two bad choices: skip normalization (losing recall) or lose provenance.

The solution: `apply_aligned`¶

Every built-in step and every Pipeline exposes apply_aligned:

normalized, omap = pipe.apply_aligned(text)

omap is an OffsetMap. Its public surface is two methods:

Method	Direction	Use
`omap.to_original((start, end))`	normalized → original	Project a model span back to raw text
`omap.to_normalized((start, end))`	original → normalized	Find where an original span ended up

Spans are half-open [start, end) intervals over Unicode code points, matching Python's str[start:end] slice convention.

RAG: cite in original, search in normalized¶

from araclean import Pipeline, RemoveTatweel, FoldAlef, RemoveTashkeel

pipe = Pipeline([RemoveTatweel(), FoldAlef(), RemoveTashkeel()])

# Index-time: normalize, store the original
original = "كتاب أحمـد الكبير"
normalized, omap = pipe.apply_aligned(original)

# Retrieval-time: a search hit in the normalized index gives a span
# Suppose a fuzzy search found "احمد" somewhere in the normalized text
found_start = normalized.index("احمد")
found_end = found_start + len("احمد")

# Project back to the original for citation
orig_start, orig_end = omap.to_original((found_start, found_end))
citation = original[orig_start:orig_end]
print(citation)   # "أحمـد" — the original spelling, with tatweel and hamza intact

NER: project model output to original text¶

from araclean import Pipeline, FoldAlef, RemoveTashkeel, RemoveTatweel

pipe = Pipeline([RemoveTatweel(), RemoveTashkeel(), FoldAlef()])

original_doc = "قال الرئيسُ محمـدٌ في المؤتمرِ"
normalized, omap = pipe.apply_aligned(original_doc)

# A NER model running on normalized text predicts a PERSON span
ner_start = normalized.index("محمد")
ner_end = ner_start + len("محمد")

# Project to original text
orig_start, orig_end = omap.to_original((ner_start, ner_end))
original_span = original_doc[orig_start:orig_end]
print(original_span)   # "محمـد" — the original spelling, tatweel intact (the deleted
                       # trailing tanween sits past the span: it produced no normalized char)

What the map tracks¶

Every normalization step is one of a few operation kinds, each tracked by the map:

Kind	Examples	Alignment
`str.translate` 1→0 (delete)	`RemoveTatweel`, `RemoveTashkeel`	Deleted char leaves no normalized char; the next normalized char points past it
`str.translate` 1→1 (replace)	`FoldAlef`, `MapDigits`	One-to-one; position is preserved
`str.translate` 1→N (expand)	`FoldPresentationForms` (lam-alef ligature → two chars)	Both result chars point back to the one source char
Regex substitution N→M	`CollapseWhitespace`, `CleanURLs`, `ReduceElongation`	Each replacement char points to the whole matched span

Pipeline composition chains the per-step maps automatically.

Composing across profiles¶

apply_aligned works with any pipeline, including named profiles:

from araclean import Pipeline, SEARCH

pipe = Pipeline.from_profile(SEARCH)
normalized, omap = pipe.apply_aligned("وَقَالَ الرَّئيسُ")

The output of pipe(text) is always identical to the normalized from apply_aligned — the method only adds the map, never changes the normalization.

Custom steps¶

A custom step that does not implement apply_aligned raises AlignmentNotSupportedError when Pipeline.apply_aligned reaches it. Add the hook to opt in:

from araclean import OffsetMap, SafetyClass

class MyStep:
    safety = SafetyClass.ENCODING_REPAIR

    def __call__(self, s: str, /) -> str:
        return s.replace("x", "y")  # 1→1 replace

    def apply_aligned(self, s: str, /) -> tuple[str, OffsetMap]:
        # 1→1 replacement: length is unchanged, each char maps to itself
        return self(s), OffsetMap.identity(len(s))

For deletions or multi-char substitutions, use OffsetMap.from_translate or OffsetMap.from_regex_sub — see the OffsetMap API reference.