- All Volumes
- Volume 33, issue 1, 2023 (5 papers)
- Volume 32, Issue 1, 2022 (6 papers)
- Volume 32, Issue 2, 2022 (5 papers)
- Volume 31, Issue 1, 2021 (6 papers)
- Volume 31, Issue 2, 2021 (6 papers)
- Volume 30, Issue 1, 2020 (7 papers)
- Volume 30, Issue 2, 2020 (6 papers)
- Volume 29, Issue 1, 2019 (6 papers)
- Volume 29, Issue 2, 2019 (6 papers)
- Volume 21, Issue 1, 2014 (12 papers)
- Volume 16, Issue 1, 2012 (10 papers)
- Volume 19, Issue 2, 2012 (13 papers)
- Volume 20, Issue 3, 2012 (13 papers)
- Volume 14, Issue 2, 2011 (10 papers)
- Volume 12, Issue1, 2010 (16 papers)
Views : 326 Downloads : 302 Download PDF
SUMono: A Representative Modern Bengali Corpus
Corresponding Author : Md. Abdullah Al Mumin (mumin-cse@sust.edu)
Authors : Abu Awal Md. Shoeb (shoeb-cse@sust.edu), Mohammad Reza Selim (selim@sust.edu), M. Zafar Iqbal (mzi@sust.edu)
Keywords : monolingual corpora, representative corpus, modern Bengali, Bengali corpus, Zipf's law
Abstract :
The development of Language Engineering applications requires availability of sizable,
reliable and representative corpora. However, such corpora are not routinely available for Bengali
language. This paper introduces Shahjalal University Monolingual (SUMono) corpus, a
representative modern Bengali corpus consisting of more than 27 million words, which is the
largest of its kind. This paper describes how we have constructed SUMono corpus from available
online and offline Bengali texts, with articles tagged as belonging to 6 domains: Natural Science,
Social Science, Computer and IT, Literature, Mass Media and Blogs. We show some
characteristics of Bengali language based upon the statistical analysis of this corpus. We also
compare the 'inherent sparseness' of Bengali with English and Arabic by observing Type-to-Token
ratio of the languages. We assess our corpus in terms of its representativeness, homogeneity and
vocabulary growth rate using established techniques like Zipf's law, distribution of function words
and Baayen's equation, respectively. We found that our corpus is balanced with respect to the
frequency distribution as well as to the range of idiosyncratic phenomena.
Published on June 30th, 2014 in Volume 21, Issue 1, Applied Sciences and Technology