Please use this identifier to cite or link to this item:
http://hdl.handle.net/10603/423359
Title: | Embedding Linguistic features for Improving Neural Machine Translation from English to Malayalam |
Researcher: | Premjith B |
Guide(s): | Soman K P and Rajendran S |
Keywords: | Computer Science Interdisciplinary Applications; Neural Machine Translation system; NMT; Machine Translation; Machine Learning; Deep Learning; Indian languages;Computational linguistics; |
University: | Amrita Vishwa Vidyapeetham University |
Completed Date: | 2021 |
Abstract: | The objective of the work presented in the thesis is to study the significance of linguistic features used in a Neural Machine Translation system (NMT) for improving the quality of the translation from English to Malayalam. Morphological characteristics of Malayalam nouns and verbs and a preposition sense disambiguation model are selected as the linguistic features for this work. newlineData is a crucial component of any Machine Learning model. The performance newlineof a model is directly proportional to the size and quality of the data. newlineAs far as Machine Translation (MT) from English to Malayalam is concerned, newlinea large gold standard corpus is not available. The available corpora contain newlinenoise and require much time to clean them. Therefore, a bilingual corpus of size 40,000 sentence pairs was developed. These sentences are collected from various resources and spread across multiple domains and used along newlinewith other publicly available data. The basic framework of the NMT system is encoder-decoder architecture. The initial model consists of Bidirectional Long Short-Term Memory (Bi-LSTM) network and Long Short-Term Memory (LSTM) at the encoder and decoder, respectively. This model has an newlineattention layer also to selectively focus on certain parts of the English text to newlinepredict the proper Malayalam words. Another model architecture replaces the Bi-LSTM and LSTM networks with a transformer network, which is observed to be outperforming the former by a signicant margin. During the development of this NMT model,it is observed that the model can not translate some of the sentences eectively. This problem occurs mainly when the number of words in the input English sentence is relatively higher than that in Malayalam sentence. It is common in Malayalam to construct wordforms by conjoining root word, morphemes and words with dierent Part-of-Speech (POS). It may even lead to the construction of a sentence,which contains only one word. Translation quality is highly reduced in such scenarios. A morphological segmentation model is utilized... |
Pagination: | xxxv, 262 |
URI: | http://hdl.handle.net/10603/423359 |
Appears in Departments: | Center for Computational Engineering and Networking (CEN) |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
01_title.pdf | Attached File | 141.72 kB | Adobe PDF | View/Open |
02_prelim pages.pdf | 1.13 MB | Adobe PDF | View/Open | |
03_contents.pdf | 107.73 kB | Adobe PDF | View/Open | |
04_abstract.pdf | 52.28 kB | Adobe PDF | View/Open | |
05_chapter 1.pdf | 279.18 kB | Adobe PDF | View/Open | |
06_chapter 2.pdf | 168.75 kB | Adobe PDF | View/Open | |
07_chapter 3.pdf | 494.86 kB | Adobe PDF | View/Open | |
08_chapter 4.pdf | 315.3 kB | Adobe PDF | View/Open | |
09_chapter 5.pdf | 222.91 kB | Adobe PDF | View/Open | |
10_chapter 6.pdf | 321.58 kB | Adobe PDF | View/Open | |
11_chapter 7.pdf | 523.42 kB | Adobe PDF | View/Open | |
12_chapter 8.pdf | 431.4 kB | Adobe PDF | View/Open | |
13_chapter 9.pdf | 91.42 kB | Adobe PDF | View/Open | |
14_annexures.pdf | 167.57 kB | Adobe PDF | View/Open | |
80_recommendation.pdf | 232.7 kB | Adobe PDF | View/Open |
Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Altmetric Badge: