Deterministic, low-latency normalization for 19 Indian languages.

Indic Text Normalization

A comprehensive WFST-based library built on Pynini that converts numbers, dates, currency, measurements and more into natural spoken form. Designed for TTS, ASR, and NLP pipelines. An extension of NVIDIA NeMo for Indic languages.

Why Not Just Use an LLM?

The current trend is to throw an LLM at text normalization. That introduces too much latency and unpredictability for real-time voice applications.

View on GitHub

1 – 5 ms

Deterministic FST traversal, not autoregressive token generation.

Deterministic

Same input, same output. No hallucinations, no temperature tuning.

CPU Only

No GPU needed. Deploy anywhere with minimal infrastructure.

Production Proven

Born from real-time TTS needs with Svara, Kenpath’s Indic TTS engine.

average normalization latency

0ms

WFST deterministic traversal

vs. LLM approaches

LLM (local)

~500+ ms100x

LLM (API)

~2000+ ms400x

1 – 5 ms

Deterministic FST traversal, not autoregressive token generation.

Deterministic

Same input, same output. No hallucinations, no temperature tuning.

CPU Only

No GPU needed. Deploy anywhere with minimal infrastructure.

Production Proven

Born from real-time TTS needs with Svara, Kenpath’s Indic TTS engine.

19 Supported Languages

Select a language to see how “25” is spoken.

Normalization Preview

हिन्दी

Hindi

25→पच्चीस

12 Semiotic Classes

Each language module normalizes these written forms into their spoken equivalents.

Cardinal Numbers

25 → पच्चीस

Ordinal Numbers

3rd → तीसरा

Decimals

3.14 → तीन दशमलव एक चार

Fractions

½ → आधा

Dates

15/08/2024 → पंद्रह अगस्त

Time

10:30 → साढ़े दस बजे

Phone Numbers

9876 → नौ आठ सात छह

Measurements

5kg → पांच किलोग्राम

Currency

₹500 → पांच सौ रुपये

Electronic

URLs, emails, hashtags

Roman Numerals

IV → चार

Abbreviations

Dr. → डॉक्टर

Quick Start

Three lines of code. Same API across all 19 languages.

example.py

1from indic_text_normalization import Normalizer
2
3# Initialize normalizer for Hindi
4normalizer = Normalizer(input_case='cased', lang='hi')
5
6# Normalize text
7text = "मैं 25 साल का हूं और मेरा फोन नंबर 9876543210 है।"
8normalized = normalizer.normalize(text)
9print(normalized)
10# → मैं पच्चीस साल का हूं और मेरा फोन नंबर
11#   नौ आठ सात छह पांच चार तीन दो एक शून्य है।

The WFST Pipeline

Deterministic, explainable, and fully traceable at every step.

Extension of NVIDIA NeMo for Indic languages

1

Tokenize

Split input into tokens

2

Classify

Identify semiotic class

3

Verbalize

Convert to spoken form

4

Post-process

Clean up and format

Example

₹500 देने हैं

money { currency: ₹ amount: 500 }

पांच सौ रुपये देने हैं

Start Contributing

Open source under Apache 2.0. Built on NVIDIA NeMo Text Processing.

Built on NVIDIA NeMo Text Processing

View Repository Report an Issue

$ git clone https://github.com/kenpath/indic-text-normalization.git

Need text normalization for your Indic language pipeline? Get in touch