Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/360622
Title: Exploration of Semantic Space of Word Vectors Using Word Embedding
Researcher: Sanjanasri J P
Guide(s): Soman K P
Keywords: Center for Computational Engineering and Networking; Natural Language Processing; NLP; Neural Networks;semantic space ; bilingual word; Word Embedding; Deep Learning; machine learning; Pruning; Indian languages
Computer Science; Interdisciplinary Applications;
University: Amrita Vishwa Vidyapeetham University
Completed Date: 2021
Abstract: The prime objective of the investigation presented in this thesis was to explore the semantic space in word vectors using neural word embedding. Thenon-existence of a clean, sentence aligned parallel corpus for English-Tamil language pair calls for a sufficiently large bilingual corpus for the implementation of various Natural Language Processing (NLP) applications such as machine translation, cross-lingual information retrieval and semantic comparison. Although word embedding has been in vogue in recent years, the adequate method for the evaluation of word embedding begs attention. Besides an in-depth discussion of the intrinsic and extrinsic evaluation of bilingual word embedding models, a data set was developed for the evaluation of English -Tamil bilingual word embedding algorithms. The data set was evaluated on a bilingual model; analysis of experimental results showcased insightful inferences into the semantics captured by word vectors and human cognition. However, bilingual embeddings typically capture common semantics and reject variations. Hence, transfer function-based generated embedding (TFGE), a deeply learned transfer function was developed, where vectors from the embedding space of one language are projected onto that of the other language.Three well regarded off-the-shelf embedding algorithms, Word2Vec, GloVe,and FastText, were used to train the TFGE model, from English, a resource rich source language, to Tamil, a resource-deficient target language, in a data efficient way. The efficacy of the proposed TFGE model was confirmed by a better synthesis of new vectors for unknown source language words. Pre -trained Word2Vec Hindi and Chinese embeddings were marshalled to appraise the deployable capability of the TFGE model across other target languages. The versatility of the developed model was substantively demonstrated in selected NLP use-cases - Text Summarization, Part Of Speech (POS) Tagging,and Bilingual Dictionary Induction (BDI).In a nutshell,the following developments are the major ...
Pagination: xxi, 162
URI: http://hdl.handle.net/10603/360622
Appears in Departments:Center for Computational Engineering and Networking (CEN)

Files in This Item:
File Description SizeFormat 
01_title.pdfAttached File145.17 kBAdobe PDFView/Open
02_certificate.pdf194.19 kBAdobe PDFView/Open
03_ preliminary pages.pdf421.73 kBAdobe PDFView/Open
04_chapter 1.pdf157.8 kBAdobe PDFView/Open
05_chapter 2.pdf440.2 kBAdobe PDFView/Open
06_chapter 3.pdf381.74 kBAdobe PDFView/Open
07_chapter 4.pdf423.78 kBAdobe PDFView/Open
08_chapter 5.pdf1.13 MBAdobe PDFView/Open
09_chapter 6.pdf114.09 kBAdobe PDFView/Open
10_bibliography.pdf156.11 kBAdobe PDFView/Open
11_publications.pdf74.95 kBAdobe PDFView/Open
80_recommendation.pdf258.83 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: