Text Classification for Telugu Datasets Embeddings and Models for Downstream NLP Tasks

Mounika Marreddy

Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/496791

Title:	Text Classification for Telugu Datasets Embeddings and Models for Downstream NLP Tasks
Researcher:	Mounika Marreddy
Guide(s):	Radhika Mamidi
Keywords:	Computer Science Computer Science Theory and Methods Engineering and Technology
University:	International Institute of Information Technology, Hyderabad
Completed Date:	2023
Abstract:	Language understanding has become crucial in different text classification tasks in newlineNatural Language Processing (NLP) applications to get the desired output. Over the past newlinedecade, machine learning and deep learning algorithms have been evolving with efficient newlinefeature representations to give better results. The applications of NLP are becoming potent, newlinedomain, and language-specific. For resource-rich languages like English, the NLP applications newlinegive desired results due to the availability of large corpora, different kinds of annotated datasets, newlineefficient feature representations, and tools. newlineDue to the lack of large corpora and annotated datasets, many resource-poor Indian languages newlinestruggle to reap the benefits of deep feature representations. Moreover, adopting existing newlinelanguage models trained on large English corpora for Indian languages is often limited by data newlineavailability, rich morphological variation, syntax, and semantic differences. Most of the work newlinebeing done in Indian languages is from a machine translation perspective. One solution is to use newlinetranslation for re-creating datasets in low resource languages from English. But in case of Indian newlinelanguages like Telugu, the meaning may change and some crucial information may be lost due newlineto translation. This is because of their structural differences, morphological complexities, and newlinesemantic differences. newlineIn this thesis, our main objective is to mitigate the low-resource problem for Telugu. Overall, to newlineaccelerate NLP research in Telugu, we present several contributions: (1) A large Telugu raw newlinecorpus of 80,15,588 sentences (16,37,408 sentences from Telugu Wikipedia and 63,78,180 newlinesentences crawled from different Telugu websites). newline(2) Annotated datasets in Telugu for sentiment analysis, emotion identification, hate speech newlinedetection, sarcasm identification, and clickbait detection. newline(3) For the Telugu corpus, we are the first to generate pre-trained distributed word and sentence newlineembeddings such as \emph{Word2Vec-Te}, \emph{GloVe-Te}, \emph{FastText-Te}, newline\emph{MetaE
Pagination:
URI:	http://hdl.handle.net/10603/496791
Appears in Departments:	Computer Science and Engineering

Files in This Item:

File	Description	Size	Format
20172152_abstract.pdf	Attached File	52.3 kB	Adobe PDF	View/Open
20172152_chapter1.pdf		766.42 kB	Adobe PDF	View/Open
20172152_chapter2.pdf		121.58 kB	Adobe PDF	View/Open
20172152_chapter3.pdf		1.14 MB	Adobe PDF	View/Open
20172152_chapter4.pdf		283.27 kB	Adobe PDF	View/Open
20172152_chapter5.pdf		889.65 kB	Adobe PDF	View/Open
20172152_chapter6.pdf		668.92 kB	Adobe PDF	View/Open
20172152_chapter7.pdf		1.75 MB	Adobe PDF	View/Open
80_recommendation.pdf		61.81 kB	Adobe PDF	View/Open
chapter10_anneures.pdf		71.75 kB	Adobe PDF	View/Open
contents.pdf		57.93 kB	Adobe PDF	View/Open
preliminary.pdf		56.3 kB	Adobe PDF	View/Open
title.pdf		33.08 kB	Adobe PDF	View/Open

Show full item record

Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge:

Shodhganga : a reservoir of Indian theses @ INFLIBNET