
Deterministic, low-latency normalization for 19 Indian languages.
Indic Text Normalization
A comprehensive WFST-based library built on Pynini that converts numbers, dates, currency, measurements and more into natural spoken form. Designed for TTS, ASR, and NLP pipelines. An extension of NVIDIA NeMo for Indic languages.
Why Not Just Use an LLM?
The current trend is to throw an LLM at text normalization. That introduces too much latency and unpredictability for real-time voice applications.
View on GitHubaverage normalization latency
0ms
WFST deterministic traversal
vs. LLM approaches
LLM (local)
~500+ ms100x
LLM (API)
~2000+ ms400x
1 – 5 ms
Deterministic FST traversal, not autoregressive token generation.
Deterministic
Same input, same output. No hallucinations, no temperature tuning.
CPU Only
No GPU needed. Deploy anywhere with minimal infrastructure.
Production Proven
Born from real-time TTS needs with Svara, Kenpath’s Indic TTS engine.
19 Supported Languages
Select a language to see how “25” is spoken.
Normalization Preview
हिन्दी
Hindi
25→पच्चीस
12 Semiotic Classes
Each language module normalizes these written forms into their spoken equivalents.
Cardinal Numbers
25 → पच्चीस
Ordinal Numbers
3rd → तीसरा
Decimals
3.14 → तीन दशमलव एक चार
Fractions
½ → आधा
Dates
15/08/2024 → पंद्रह अगस्त
Time
10:30 → साढ़े दस बजे
Phone Numbers
9876 → नौ आठ सात छह
Measurements
5kg → पांच किलोग्राम
Currency
₹500 → पांच सौ रुपये
Electronic
URLs, emails, hashtags
Roman Numerals
IV → चार
Abbreviations
Dr. → डॉक्टर
Quick Start
Three lines of code. Same API across all 19 languages.
example.py
1from indic_text_normalization import Normalizer23# Initialize normalizer for Hindi4normalizer = Normalizer(input_case='cased', lang='hi')56# Normalize text7text = "मैं 25 साल का हूं और मेरा फोन नंबर 9876543210 है।"8normalized = normalizer.normalize(text)9print(normalized)10# → मैं पच्चीस साल का हूं और मेरा फोन नंबर11# नौ आठ सात छह पांच चार तीन दो एक शून्य है।
The WFST Pipeline
Deterministic, explainable, and fully traceable at every step.
Extension of NVIDIA NeMo for Indic languages
1
Tokenize
Split input into tokens
2
Classify
Identify semiotic class
3
Verbalize
Convert to spoken form
4
Post-process
Clean up and format
Example
₹500 देने हैं
money { currency: ₹ amount: 500 }
पांच सौ रुपये देने हैं
Start Contributing
Open source under Apache 2.0. Built on NVIDIA NeMo Text Processing.
Built on NVIDIA NeMo Text Processing
$ git clone https://github.com/kenpath/indic-text-normalization.gitNeed text normalization for your Indic language pipeline? Get in touch