Please use this identifier to cite or link to this item:
http://hdl.handle.net/10603/325097
Title: | Generative models for learning document representations along with their uncertaities |
Researcher: | Santosh Kesiraju |
Guide(s): | Suryakanth V Gangashetty and Lukas Burget |
Keywords: | Computer Science Computer Science Artificial Intelligence Engineering and Technology |
University: | International Institute of Information Technology, Hyderabad |
Completed Date: | 2021 |
Abstract: | Majority of speech and NLP applications rely on word and document newlineembeddings. The document embeddings encode semantic information newlinewhich makes them suitable for tasks such as topic ID, discovery, newlineLM adaptation, and query-based document retrieval. These embeddings newlineare usually learned from widely available unlabelled data; newlinehence generative models are suitable. newline newlineMajority of the existing models often ignore to capture the newlineuncertainty in the estimated embeddings. Thus, any error in the newlineembeddings affects the performance in downstream tasks. The uncertainty newlinein the embeddings is usually due to shorter, ambiguous or noisy sentences. newline newlineThis thesis presents models for learning to represent document embeddings newlinein the form of Gaussian distributions, thereby encoding the uncertainty in newlinetheir covariances. Further, these learned uncertainties in embeddings are newlineexploited by the proposed generative Gaussian linear classifier (GLC) for topic ID. newline newlineSubspace multinomial model (SMM) is proposed for learning document embeddings. newlineExperiments on 20Newsgroups (20NG) corpus show that the embeddings from SMM newlineare superior when compared to popular topic models such as LDA, sparse newlinetopical coding in topic ID and document clustering tasks. Using the newlineVB framework on SMM, the model is able to infer the uncertainty newlinein document embeddings. Additionally, the common problem of intractability newlinein VB of mixed-logit models is addressed using Monte Carlo sampling via the newlinere-parametrization trick. The resulting Bayesian SMM achieves state-of-the-art newlineperplexity results on 20NG text and Fisher speech corpora. The proposed generative newlineGLC exploits the uncertainty in the document embeddings and achieves newlinestate-of-the-art classification results on the aforementioned corpora as newlinecompared to other unsupervised models. newline newlineFurthermore, a multilingual variant of the Bayesian SMM is proposed, that newlineachieves superior zero-shot cross-lingual topic ID results on MLDoc corpus newlineas compared to multilingual word embeddings and seq2seq BiLSTM sytems newline |
Pagination: | |
URI: | http://hdl.handle.net/10603/325097 |
Appears in Departments: | Computer Science and Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
80_recommendation.pdf | Attached File | 30.96 kB | Adobe PDF | View/Open |
appendix_a.pdf | 47.01 kB | Adobe PDF | View/Open | |
certificate.pdf | 173.32 kB | Adobe PDF | View/Open | |
chapter_1.pdf | 89.74 kB | Adobe PDF | View/Open | |
chapter_2.pdf | 39.33 kB | Adobe PDF | View/Open | |
chapter_3.pdf | 888.86 kB | Adobe PDF | View/Open | |
chapter_4.pdf | 1.49 MB | Adobe PDF | View/Open | |
chapter_5.pdf | 2.72 MB | Adobe PDF | View/Open | |
chapter_6.pdf | 752.19 kB | Adobe PDF | View/Open | |
chapter_7.pdf | 468.69 kB | Adobe PDF | View/Open | |
chapter_8.pdf | 69.85 kB | Adobe PDF | View/Open | |
title_page.pdf | 177.91 kB | Adobe PDF | View/Open |
Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Altmetric Badge: