Embedding Linguistic features for Improving Neural Machine Translation from English to Malayalam

Premjith B

Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/423359

Title:	Embedding Linguistic features for Improving Neural Machine Translation from English to Malayalam
Researcher:	Premjith B
Guide(s):	Soman K P and Rajendran S
Keywords:	Computer Science Interdisciplinary Applications; Neural Machine Translation system; NMT; Machine Translation; Machine Learning; Deep Learning; Indian languages;Computational linguistics;
University:	Amrita Vishwa Vidyapeetham University
Completed Date:	2021
Abstract:	The objective of the work presented in the thesis is to study the significance of linguistic features used in a Neural Machine Translation system (NMT) for improving the quality of the translation from English to Malayalam. Morphological characteristics of Malayalam nouns and verbs and a preposition sense disambiguation model are selected as the linguistic features for this work. newlineData is a crucial component of any Machine Learning model. The performance newlineof a model is directly proportional to the size and quality of the data. newlineAs far as Machine Translation (MT) from English to Malayalam is concerned, newlinea large gold standard corpus is not available. The available corpora contain newlinenoise and require much time to clean them. Therefore, a bilingual corpus of size 40,000 sentence pairs was developed. These sentences are collected from various resources and spread across multiple domains and used along newlinewith other publicly available data. The basic framework of the NMT system is encoder-decoder architecture. The initial model consists of Bidirectional Long Short-Term Memory (Bi-LSTM) network and Long Short-Term Memory (LSTM) at the encoder and decoder, respectively. This model has an newlineattention layer also to selectively focus on certain parts of the English text to newlinepredict the proper Malayalam words. Another model architecture replaces the Bi-LSTM and LSTM networks with a transformer network, which is observed to be outperforming the former by a signicant margin. During the development of this NMT model,it is observed that the model can not translate some of the sentences eectively. This problem occurs mainly when the number of words in the input English sentence is relatively higher than that in Malayalam sentence. It is common in Malayalam to construct wordforms by conjoining root word, morphemes and words with dierent Part-of-Speech (POS). It may even lead to the construction of a sentence,which contains only one word. Translation quality is highly reduced in such scenarios. A morphological segmentation model is utilized...
Pagination:	xxxv, 262
URI:	http://hdl.handle.net/10603/423359
Appears in Departments:	Center for Computational Engineering and Networking (CEN)

Files in This Item:

File	Description	Size	Format
01_title.pdf	Attached File	141.72 kB	Adobe PDF	View/Open
02_prelim pages.pdf		1.13 MB	Adobe PDF	View/Open
03_contents.pdf		107.73 kB	Adobe PDF	View/Open
04_abstract.pdf		52.28 kB	Adobe PDF	View/Open
05_chapter 1.pdf		279.18 kB	Adobe PDF	View/Open
06_chapter 2.pdf		168.75 kB	Adobe PDF	View/Open
07_chapter 3.pdf		494.86 kB	Adobe PDF	View/Open
08_chapter 4.pdf		315.3 kB	Adobe PDF	View/Open
09_chapter 5.pdf		222.91 kB	Adobe PDF	View/Open
10_chapter 6.pdf		321.58 kB	Adobe PDF	View/Open
11_chapter 7.pdf		523.42 kB	Adobe PDF	View/Open
12_chapter 8.pdf		431.4 kB	Adobe PDF	View/Open
13_chapter 9.pdf		91.42 kB	Adobe PDF	View/Open
14_annexures.pdf		167.57 kB	Adobe PDF	View/Open
80_recommendation.pdf		232.7 kB	Adobe PDF	View/Open

Show full item record

Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge:

Shodhganga : a reservoir of Indian theses @ INFLIBNET