Context-Aware Homograph Disambiguation in Bangla: A Multi-Model Machine Learning Approach
DOI:
https://doi.org/10.63512/sustjst.2024.1005Keywords:
Homograph Disambiguation, Bangla TTS, Machine Learning, Naive Bayes, Random Forest, SVMAbstract
This article emphasizes the challenge of Bangla homograph disambiguation in Text-to-Speech (TTS) synthesis, where identically spelled words (homographs) have different meanings or pronunciations based on context. We propose a machine learning-based solution integrating five approaches: Naive Bayes, Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost. A strict hyperparameter tuning process and rigorous feature engineering were applied to evaluate these models' effectiveness in correctly disambiguating homographs. The study focused on five homograph pairs—Dak_Dako (ডাক/ডাকো), Bol_Bolo (বল/বলো), Kal_Kalo (কাল/কালো), Komol_Komlo (কমল/কমলো), and Mot_Moto (মত/মতো)—resulting in significant improvements in TTS accuracy and reduced pronunciation errors. Our findings highlight that contextual n-grams are crucial for accurate disambiguation, with SVM emerging as the top-performing model, achieving the highest accuracy across all datasets, with values ranging from 0.9215 (Komol_Komlo) to 0.9752 (Kal_Kalo). The results suggest a robust framework for homograph disambiguation in resource-constrained languages like Bangla.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 SUST Journal of Science and Technology (SUST JST)

This work is licensed under a Creative Commons Attribution 4.0 International License.
