Skip to content

Offset-preserving normalization

araclean's flagship differentiator: normalize text and keep a map from every position in the normalized string back to the original. No other Arabic NLP library exposes this.

The problem

When you normalize Arabic for search or ML, you change the text. Tatweel is stripped, alef variants fold, tashkeel disappears. That is fine for indexing — but the moment you need to:

  • cite a span in the original document (RAG answer grounding)
  • project a model prediction back to raw text (NER / span annotation)

you need to know where the normalized span came from. Without an offset map, you have two bad choices: skip normalization (losing recall) or lose provenance.

The solution: apply_aligned

Every built-in step and every Pipeline exposes apply_aligned:

normalized, omap = pipe.apply_aligned(text)

omap is an OffsetMap. Its public surface is two methods:

Method Direction Use
omap.to_original((start, end)) normalized → original Project a model span back to raw text
omap.to_normalized((start, end)) original → normalized Find where an original span ended up

Spans are half-open [start, end) intervals over Unicode code points, matching Python's str[start:end] slice convention.

RAG: cite in original, search in normalized

from araclean import Pipeline, RemoveTatweel, FoldAlef, RemoveTashkeel

pipe = Pipeline([RemoveTatweel(), FoldAlef(), RemoveTashkeel()])

# Index-time: normalize, store the original
original = "كتاب أحمـد الكبير"
normalized, omap = pipe.apply_aligned(original)

# Retrieval-time: a search hit in the normalized index gives a span
# Suppose a fuzzy search found "احمد" somewhere in the normalized text
found_start = normalized.index("احمد")
found_end = found_start + len("احمد")

# Project back to the original for citation
orig_start, orig_end = omap.to_original((found_start, found_end))
citation = original[orig_start:orig_end]
print(citation)   # "أحمـد" — the original spelling, with tatweel and hamza intact

NER: project model output to original text

from araclean import Pipeline, FoldAlef, RemoveTashkeel, RemoveTatweel

pipe = Pipeline([RemoveTatweel(), RemoveTashkeel(), FoldAlef()])

original_doc = "قال الرئيسُ محمـدٌ في المؤتمرِ"
normalized, omap = pipe.apply_aligned(original_doc)

# A NER model running on normalized text predicts a PERSON span
ner_start = normalized.index("محمد")
ner_end = ner_start + len("محمد")

# Project to original text
orig_start, orig_end = omap.to_original((ner_start, ner_end))
original_span = original_doc[orig_start:orig_end]
print(original_span)   # "محمـد" — the original spelling, tatweel intact (the deleted
                       # trailing tanween sits past the span: it produced no normalized char)

What the map tracks

Every normalization step is one of a few operation kinds, each tracked by the map:

Kind Examples Alignment
str.translate 1→0 (delete) RemoveTatweel, RemoveTashkeel Deleted char leaves no normalized char; the next normalized char points past it
str.translate 1→1 (replace) FoldAlef, MapDigits One-to-one; position is preserved
str.translate 1→N (expand) FoldPresentationForms (lam-alef ligature → two chars) Both result chars point back to the one source char
Regex substitution N→M CollapseWhitespace, CleanURLs, ReduceElongation Each replacement char points to the whole matched span

Pipeline composition chains the per-step maps automatically.

Composing across profiles

apply_aligned works with any pipeline, including named profiles:

from araclean import Pipeline, SEARCH

pipe = Pipeline.from_profile(SEARCH)
normalized, omap = pipe.apply_aligned("وَقَالَ الرَّئيسُ")

The output of pipe(text) is always identical to the normalized from apply_aligned — the method only adds the map, never changes the normalization.

Custom steps

A custom step that does not implement apply_aligned raises AlignmentNotSupportedError when Pipeline.apply_aligned reaches it. Add the hook to opt in:

from araclean import OffsetMap, SafetyClass

class MyStep:
    safety = SafetyClass.ENCODING_REPAIR

    def __call__(self, s: str, /) -> str:
        return s.replace("x", "y")  # 1→1 replace

    def apply_aligned(self, s: str, /) -> tuple[str, OffsetMap]:
        # 1→1 replacement: length is unchanged, each char maps to itself
        return self(s), OffsetMap.identity(len(s))

For deletions or multi-char substitutions, use OffsetMap.from_translate or OffsetMap.from_regex_sub — see the OffsetMap API reference.