Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/423359
Title: Embedding Linguistic features for Improving Neural Machine Translation from English to Malayalam
Researcher: Premjith B
Guide(s): Soman K P and Rajendran S
Keywords: 
Computer Science Interdisciplinary Applications; Neural Machine Translation system; NMT; Machine Translation; Machine Learning; Deep Learning; Indian languages;Computational linguistics;
University: Amrita Vishwa Vidyapeetham University
Completed Date: 2021
Abstract: The objective of the work presented in the thesis is to study the significance of linguistic features used in a Neural Machine Translation system (NMT) for improving the quality of the translation from English to Malayalam. Morphological characteristics of Malayalam nouns and verbs and a preposition sense disambiguation model are selected as the linguistic features for this work. newlineData is a crucial component of any Machine Learning model. The performance newlineof a model is directly proportional to the size and quality of the data. newlineAs far as Machine Translation (MT) from English to Malayalam is concerned, newlinea large gold standard corpus is not available. The available corpora contain newlinenoise and require much time to clean them. Therefore, a bilingual corpus of size 40,000 sentence pairs was developed. These sentences are collected from various resources and spread across multiple domains and used along newlinewith other publicly available data. The basic framework of the NMT system is encoder-decoder architecture. The initial model consists of Bidirectional Long Short-Term Memory (Bi-LSTM) network and Long Short-Term Memory (LSTM) at the encoder and decoder, respectively. This model has an newlineattention layer also to selectively focus on certain parts of the English text to newlinepredict the proper Malayalam words. Another model architecture replaces the Bi-LSTM and LSTM networks with a transformer network, which is observed to be outperforming the former by a signicant margin. During the development of this NMT model,it is observed that the model can not translate some of the sentences eectively. This problem occurs mainly when the number of words in the input English sentence is relatively higher than that in Malayalam sentence. It is common in Malayalam to construct wordforms by conjoining root word, morphemes and words with dierent Part-of-Speech (POS). It may even lead to the construction of a sentence,which contains only one word. Translation quality is highly reduced in such scenarios. A morphological segmentation model is utilized...
Pagination: xxxv, 262
URI: http://hdl.handle.net/10603/423359
Appears in Departments:Center for Computational Engineering and Networking (CEN)

Files in This Item:
File Description SizeFormat 
01_title.pdfAttached File141.72 kBAdobe PDFView/Open
02_prelim pages.pdf1.13 MBAdobe PDFView/Open
03_contents.pdf107.73 kBAdobe PDFView/Open
04_abstract.pdf52.28 kBAdobe PDFView/Open
05_chapter 1.pdf279.18 kBAdobe PDFView/Open
06_chapter 2.pdf168.75 kBAdobe PDFView/Open
07_chapter 3.pdf494.86 kBAdobe PDFView/Open
08_chapter 4.pdf315.3 kBAdobe PDFView/Open
09_chapter 5.pdf222.91 kBAdobe PDFView/Open
10_chapter 6.pdf321.58 kBAdobe PDFView/Open
11_chapter 7.pdf523.42 kBAdobe PDFView/Open
12_chapter 8.pdf431.4 kBAdobe PDFView/Open
13_chapter 9.pdf91.42 kBAdobe PDFView/Open
14_annexures.pdf167.57 kBAdobe PDFView/Open
80_recommendation.pdf232.7 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: