Indic Text Normalization
Deterministic, low-latency normalization for 19 Indian languages.

Indic Text Normalization

A comprehensive WFST-based library built on Pynini that converts numbers, dates, currency, measurements and more into natural spoken form. Designed for TTS, ASR, and NLP pipelines. An extension of NVIDIA NeMo for Indic languages.

Why Not Just Use an LLM?

The current trend is to throw an LLM at text normalization. That introduces too much latency and unpredictability for real-time voice applications.
View on GitHub

average normalization latency

0ms

WFST deterministic traversal

vs. LLM approaches

LLM (local)
~500+ ms100x
LLM (API)
~2000+ ms400x
1 – 5 ms

Deterministic FST traversal, not autoregressive token generation.

Deterministic

Same input, same output. No hallucinations, no temperature tuning.

CPU Only

No GPU needed. Deploy anywhere with minimal infrastructure.

Production Proven

Born from real-time TTS needs with Svara, Kenpath’s Indic TTS engine.

19 Supported Languages

Select a language to see how “25” is spoken.
Normalization Preview
हिन्दी
Hindi
25पच्चीस

12 Semiotic Classes

Each language module normalizes these written forms into their spoken equivalents.
Cardinal Numbers
25 → पच्चीस
Ordinal Numbers
3rd → तीसरा
Decimals
3.14 → तीन दशमलव एक चार
Fractions
½ → आधा
Dates
15/08/2024 → पंद्रह अगस्त
Time
10:30 → साढ़े दस बजे
Phone Numbers
9876 → नौ आठ सात छह
Measurements
5kg → पांच किलोग्राम
Currency
₹500 → पांच सौ रुपये
Electronic
URLs, emails, hashtags
Roman Numerals
IV → चार
Abbreviations
Dr. → डॉक्टर

Quick Start

Three lines of code. Same API across all 19 languages.
example.py
1from indic_text_normalization import Normalizer
2
3# Initialize normalizer for Hindi
4normalizer = Normalizer(input_case='cased', lang='hi')
5
6# Normalize text
7text = "मैं 25 साल का हूं और मेरा फोन नंबर 9876543210 है।"
8normalized = normalizer.normalize(text)
9print(normalized)
10# → मैं पच्चीस साल का हूं और मेरा फोन नंबर
11# नौ आठ सात छह पांच चार तीन दो एक शून्य है।

The WFST Pipeline

Deterministic, explainable, and fully traceable at every step.
Extension of NVIDIA NeMo for Indic languages
1
Tokenize
Split input into tokens
2
Classify
Identify semiotic class
3
Verbalize
Convert to spoken form
4
Post-process
Clean up and format
Example
₹500 देने हैं
money { currency: ₹ amount: 500 }
पांच सौ रुपये देने हैं

Start Contributing

Open source under Apache 2.0. Built on NVIDIA NeMo Text Processing.

Built on NVIDIA NeMo Text Processing
$ git clone https://github.com/kenpath/indic-text-normalization.git

Need text normalization for your Indic language pipeline? Get in touch