
Daybreak India’s AI still doesn’t speak India. Can it?
23 snips
Feb 2, 2026 They test ChatGPT's Punjabi and find spelling errors and Hindi bleed into responses. The episode explores how Hindi dominates datasets while many regional languages and dialects are ignored. It contrasts fast, private datasets with underused government corpora and explains why multimodal data and legal costs matter. The conversation warns that AI is flattening India’s linguistic diversity.
AI Snips
Chapters
Transcript
Episode notes
The Two Worlds Of Vernacular AI
- Vernacular AI sits in two fragmented universes: private datasets optimised for speed and public corpora focused on nuance.
- These datasets rarely cross, leaving major gaps for Indian-language AI.
Hindi Bots Boost Tax Collection
- Gurugram used Hindi AI calls to remind taxpayers and collected about Rs 200 crore.
- The success hinged on Hindi's strong representation in datasets, not on broad linguistic coverage.
Performance Trumps Inclusivity
- Private-sector language datasets prioritise speed and accuracy over inclusivity and dialectal nuance.
- That optimization flattens languages into standardized forms that miss real regional speech.
