Generative models for learning document representations along with their uncertaities

Santosh Kesiraju

Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/325097

Title:	Generative models for learning document representations along with their uncertaities
Researcher:	Santosh Kesiraju
Guide(s):	Suryakanth V Gangashetty and Lukas Burget
Keywords:	Computer Science Computer Science Artificial Intelligence Engineering and Technology
University:	International Institute of Information Technology, Hyderabad
Completed Date:	2021
Abstract:	Majority of speech and NLP applications rely on word and document newlineembeddings. The document embeddings encode semantic information newlinewhich makes them suitable for tasks such as topic ID, discovery, newlineLM adaptation, and query-based document retrieval. These embeddings newlineare usually learned from widely available unlabelled data; newlinehence generative models are suitable. newline newlineMajority of the existing models often ignore to capture the newlineuncertainty in the estimated embeddings. Thus, any error in the newlineembeddings affects the performance in downstream tasks. The uncertainty newlinein the embeddings is usually due to shorter, ambiguous or noisy sentences. newline newlineThis thesis presents models for learning to represent document embeddings newlinein the form of Gaussian distributions, thereby encoding the uncertainty in newlinetheir covariances. Further, these learned uncertainties in embeddings are newlineexploited by the proposed generative Gaussian linear classifier (GLC) for topic ID. newline newlineSubspace multinomial model (SMM) is proposed for learning document embeddings. newlineExperiments on 20Newsgroups (20NG) corpus show that the embeddings from SMM newlineare superior when compared to popular topic models such as LDA, sparse newlinetopical coding in topic ID and document clustering tasks. Using the newlineVB framework on SMM, the model is able to infer the uncertainty newlinein document embeddings. Additionally, the common problem of intractability newlinein VB of mixed-logit models is addressed using Monte Carlo sampling via the newlinere-parametrization trick. The resulting Bayesian SMM achieves state-of-the-art newlineperplexity results on 20NG text and Fisher speech corpora. The proposed generative newlineGLC exploits the uncertainty in the document embeddings and achieves newlinestate-of-the-art classification results on the aforementioned corpora as newlinecompared to other unsupervised models. newline newlineFurthermore, a multilingual variant of the Bayesian SMM is proposed, that newlineachieves superior zero-shot cross-lingual topic ID results on MLDoc corpus newlineas compared to multilingual word embeddings and seq2seq BiLSTM sytems newline
Pagination:
URI:	http://hdl.handle.net/10603/325097
Appears in Departments:	Computer Science and Engineering

Files in This Item:

File	Description	Size	Format
80_recommendation.pdf	Attached File	30.96 kB	Adobe PDF	View/Open
appendix_a.pdf		47.01 kB	Adobe PDF	View/Open
certificate.pdf		173.32 kB	Adobe PDF	View/Open
chapter_1.pdf		89.74 kB	Adobe PDF	View/Open
chapter_2.pdf		39.33 kB	Adobe PDF	View/Open
chapter_3.pdf		888.86 kB	Adobe PDF	View/Open
chapter_4.pdf		1.49 MB	Adobe PDF	View/Open
chapter_5.pdf		2.72 MB	Adobe PDF	View/Open
chapter_6.pdf		752.19 kB	Adobe PDF	View/Open
chapter_7.pdf		468.69 kB	Adobe PDF	View/Open
chapter_8.pdf		69.85 kB	Adobe PDF	View/Open
title_page.pdf		177.91 kB	Adobe PDF	View/Open

Show full item record

Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge:

Shodhganga : a reservoir of Indian theses @ INFLIBNET