Views : 267       Downloads : 253 Download PDF




SUMono: A Representative Modern Bengali Corpus

Corresponding Author : Md. Abdullah Al Mumin (mumin-cse@sust.edu)

Authors : Abu Awal Md. Shoeb (shoeb-cse@sust.edu), Mohammad Reza Selim (selim@sust.edu), M. Zafar Iqbal (mzi@sust.edu)

Keywords : monolingual corpora, representative corpus, modern Bengali, Bengali corpus, Zipf's law

Abstract :

The development of Language Engineering applications requires availability of sizable, reliable and representative corpora. However, such corpora are not routinely available for Bengali language. This paper introduces Shahjalal University Monolingual (SUMono) corpus, a representative modern Bengali corpus consisting of more than 27 million words, which is the largest of its kind. This paper describes how we have constructed SUMono corpus from available online and offline Bengali texts, with articles tagged as belonging to 6 domains: Natural Science, Social Science, Computer and IT, Literature, Mass Media and Blogs. We show some characteristics of Bengali language based upon the statistical analysis of this corpus. We also compare the 'inherent sparseness' of Bengali with English and Arabic by observing Type-to-Token ratio of the languages. We assess our corpus in terms of its representativeness, homogeneity and vocabulary growth rate using established techniques like Zipf's law, distribution of function words and Baayen's equation, respectively. We found that our corpus is balanced with respect to the frequency distribution as well as to the range of idiosyncratic phenomena.  

Published on June 30th, 2014 in Volume 21, Issue 1, Applied Sciences and Technology