Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/325097
Title: Generative models for learning document representations along with their uncertaities
Researcher: Santosh Kesiraju
Guide(s): Suryakanth V Gangashetty and Lukas Burget
Keywords: Computer Science
Computer Science Artificial Intelligence
Engineering and Technology
University: International Institute of Information Technology, Hyderabad
Completed Date: 2021
Abstract: Majority of speech and NLP applications rely on word and document newlineembeddings. The document embeddings encode semantic information newlinewhich makes them suitable for tasks such as topic ID, discovery, newlineLM adaptation, and query-based document retrieval. These embeddings newlineare usually learned from widely available unlabelled data; newlinehence generative models are suitable. newline newlineMajority of the existing models often ignore to capture the newlineuncertainty in the estimated embeddings. Thus, any error in the newlineembeddings affects the performance in downstream tasks. The uncertainty newlinein the embeddings is usually due to shorter, ambiguous or noisy sentences. newline newlineThis thesis presents models for learning to represent document embeddings newlinein the form of Gaussian distributions, thereby encoding the uncertainty in newlinetheir covariances. Further, these learned uncertainties in embeddings are newlineexploited by the proposed generative Gaussian linear classifier (GLC) for topic ID. newline newlineSubspace multinomial model (SMM) is proposed for learning document embeddings. newlineExperiments on 20Newsgroups (20NG) corpus show that the embeddings from SMM newlineare superior when compared to popular topic models such as LDA, sparse newlinetopical coding in topic ID and document clustering tasks. Using the newlineVB framework on SMM, the model is able to infer the uncertainty newlinein document embeddings. Additionally, the common problem of intractability newlinein VB of mixed-logit models is addressed using Monte Carlo sampling via the newlinere-parametrization trick. The resulting Bayesian SMM achieves state-of-the-art newlineperplexity results on 20NG text and Fisher speech corpora. The proposed generative newlineGLC exploits the uncertainty in the document embeddings and achieves newlinestate-of-the-art classification results on the aforementioned corpora as newlinecompared to other unsupervised models. newline newlineFurthermore, a multilingual variant of the Bayesian SMM is proposed, that newlineachieves superior zero-shot cross-lingual topic ID results on MLDoc corpus newlineas compared to multilingual word embeddings and seq2seq BiLSTM sytems newline
Pagination: 
URI: http://hdl.handle.net/10603/325097
Appears in Departments:Computer Science and Engineering

Files in This Item:
File Description SizeFormat 
80_recommendation.pdfAttached File30.96 kBAdobe PDFView/Open
appendix_a.pdf47.01 kBAdobe PDFView/Open
certificate.pdf173.32 kBAdobe PDFView/Open
chapter_1.pdf89.74 kBAdobe PDFView/Open
chapter_2.pdf39.33 kBAdobe PDFView/Open
chapter_3.pdf888.86 kBAdobe PDFView/Open
chapter_4.pdf1.49 MBAdobe PDFView/Open
chapter_5.pdf2.72 MBAdobe PDFView/Open
chapter_6.pdf752.19 kBAdobe PDFView/Open
chapter_7.pdf468.69 kBAdobe PDFView/Open
chapter_8.pdf69.85 kBAdobe PDFView/Open
title_page.pdf177.91 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: