Context-Aware Homograph Disambiguation in Bangla: A Multi-Model Machine Learning Approach

Sayma Sultana Chowdhury

doi:10.63512/sustjst.2024.1005

Authors

Sayma Sultana Chowdhury Shahjalal University of Science and Technology

DOI:

https://doi.org/10.63512/sustjst.2024.1005

Keywords:

Homograph Disambiguation, Bangla TTS, Machine Learning, Naive Bayes, Random Forest, SVM

Abstract

This article emphasizes the challenge of Bangla homograph disambiguation in Text-to-Speech (TTS) synthesis, where identically spelled words (homographs) have different meanings or pronunciations based on context. We propose a machine learning-based solution integrating five approaches: Naive Bayes, Support Vector Machine (SVM), Logistic Regression, Random Forest, and XGBoost. A strict hyperparameter tuning process and rigorous feature engineering were applied to evaluate these models' effectiveness in correctly disambiguating homographs. The study focused on five homograph pairs—Dak_Dako (ডাক/ডাকো), Bol_Bolo (বল/বলো), Kal_Kalo (কাল/কালো), Komol_Komlo (কমল/কমলো), and Mot_Moto (মত/মতো)—resulting in significant improvements in TTS accuracy and reduced pronunciation errors. Our findings highlight that contextual n-grams are crucial for accurate disambiguation, with SVM emerging as the top-performing model, achieving the highest accuracy across all datasets, with values ranging from 0.9215 (Komol_Komlo) to 0.9752 (Kal_Kalo). The results suggest a robust framework for homograph disambiguation in resource-constrained languages like Bangla.

Context-Aware Homograph Disambiguation in Bangla: A Multi-Model Machine Learning Approach

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Make a Submission

Browse by Year

Informations

ithentic