- All Volumes
- Volume 33, issue 1, 2023 (5 papers)
- Volume 32, Issue 1, 2022 (6 papers)
- Volume 32, Issue 2, 2022 (5 papers)
- Volume 31, Issue 1, 2021 (6 papers)
- Volume 31, Issue 2, 2021 (6 papers)
- Volume 30, Issue 1, 2020 (7 papers)
- Volume 30, Issue 2, 2020 (6 papers)
- Volume 29, Issue 1, 2019 (6 papers)
- Volume 29, Issue 2, 2019 (6 papers)
- Volume 21, Issue 1, 2014 (12 papers)
- Volume 16, Issue 1, 2012 (10 papers)
- Volume 19, Issue 2, 2012 (13 papers)
- Volume 20, Issue 3, 2012 (13 papers)
- Volume 14, Issue 2, 2011 (10 papers)
- Volume 12, Issue1, 2010 (16 papers)
Views : 332 Downloads : 265 Download PDF
SUPara: A Balanced English-Bengali Parallel Corpus
Corresponding Author : Md. Abdullah Al Mumin (mumin-cse@sust.edu)
Authors : Abu Awal Md. Shoeb, (shoeb-cse@sust.edu), Md. Reza Selim (selim@sust.edu), M. Zafar Iqbal (mzi@sust.edu)
Keywords : parallel corpora, corpus design, balanced corpus
Abstract :
Parallel corpora have become an essential resource in natural language processing. In spite of
their importance in many multi-lingual applications, a few effective English-Bengali corpus has
been made available, given the scarcity of its resources and the intensive lobors required in its
creation. This paper introduces Shahjalal University Parallel (SUPara) corpus, an English-Bengali
sentence-aligned parallel corpus consisting of more than 200,000 words in either languages, which
is the largest among freely released corpus of its kind.
A balanced corpus refers to carefully selected and fully described body of natural language
texts, which more or less represent the language. SUPara is balanced according to five text types
(literature, journalistic texts, instructive texts, administrative texts and texts treating external
communication) and freely accessible to the research community. In this paper, we address the
development process of SUPara corpus in context of its balanced design, universal encoding,
lingustic markup and sentence alignment. The statistics of the corpus are also presented here. To
the best of our knowledge, SUPara has been the first freely released balanced English-Bengali
corpus.
Published on December 31st, 2012 in Volume 16, Issue 1, Applied Sciences and Technology