Views : 332       Downloads : 267 Download PDF




SUPara: A Balanced English-Bengali Parallel Corpus

Corresponding Author : Md. Abdullah Al Mumin (mumin-cse@sust.edu)

Authors : Abu Awal Md. Shoeb, (shoeb-cse@sust.edu), Md. Reza Selim (selim@sust.edu), M. Zafar Iqbal (mzi@sust.edu)

Keywords : parallel corpora, corpus design, balanced corpus

Abstract :

Parallel corpora have become an essential resource in natural language processing. In spite of their importance in many multi-lingual applications, a few effective English-Bengali corpus has been made available, given the scarcity of its resources and the intensive lobors required in its creation. This paper introduces Shahjalal University Parallel (SUPara) corpus, an English-Bengali sentence-aligned parallel corpus consisting of more than 200,000 words in either languages, which is the largest among freely released corpus of its kind. A balanced corpus refers to carefully selected and fully described body of natural language texts, which more or less represent the language. SUPara is balanced according to five text types (literature, journalistic texts, instructive texts, administrative texts and texts treating external communication) and freely accessible to the research community. In this paper, we address the development process of SUPara corpus in context of its balanced design, universal encoding, lingustic markup and sentence alignment. The statistics of the corpus are also presented here. To the best of our knowledge, SUPara has been the first freely released balanced English-Bengali corpus.  

Published on December 31st, 2012 in Volume 16, Issue 1, Applied Sciences and Technology