Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/496791
Title: Text Classification for Telugu Datasets Embeddings and Models for Downstream NLP Tasks
Researcher: Mounika Marreddy
Guide(s): Radhika Mamidi
Keywords: Computer Science
Computer Science Theory and Methods
Engineering and Technology
University: International Institute of Information Technology, Hyderabad
Completed Date: 2023
Abstract: Language understanding has become crucial in different text classification tasks in newlineNatural Language Processing (NLP) applications to get the desired output. Over the past newlinedecade, machine learning and deep learning algorithms have been evolving with efficient newlinefeature representations to give better results. The applications of NLP are becoming potent, newlinedomain, and language-specific. For resource-rich languages like English, the NLP applications newlinegive desired results due to the availability of large corpora, different kinds of annotated datasets, newlineefficient feature representations, and tools. newlineDue to the lack of large corpora and annotated datasets, many resource-poor Indian languages newlinestruggle to reap the benefits of deep feature representations. Moreover, adopting existing newlinelanguage models trained on large English corpora for Indian languages is often limited by data newlineavailability, rich morphological variation, syntax, and semantic differences. Most of the work newlinebeing done in Indian languages is from a machine translation perspective. One solution is to use newlinetranslation for re-creating datasets in low resource languages from English. But in case of Indian newlinelanguages like Telugu, the meaning may change and some crucial information may be lost due newlineto translation. This is because of their structural differences, morphological complexities, and newlinesemantic differences. newlineIn this thesis, our main objective is to mitigate the low-resource problem for Telugu. Overall, to newlineaccelerate NLP research in Telugu, we present several contributions: (1) A large Telugu raw newlinecorpus of 80,15,588 sentences (16,37,408 sentences from Telugu Wikipedia and 63,78,180 newlinesentences crawled from different Telugu websites). newline(2) Annotated datasets in Telugu for sentiment analysis, emotion identification, hate speech newlinedetection, sarcasm identification, and clickbait detection. newline(3) For the Telugu corpus, we are the first to generate pre-trained distributed word and sentence newlineembeddings such as \emph{Word2Vec-Te}, \emph{GloVe-Te}, \emph{FastText-Te}, newline\emph{MetaE
Pagination: 
URI: http://hdl.handle.net/10603/496791
Appears in Departments:Computer Science and Engineering

Files in This Item:
File Description SizeFormat 
20172152_abstract.pdfAttached File52.3 kBAdobe PDFView/Open
20172152_chapter1.pdf766.42 kBAdobe PDFView/Open
20172152_chapter2.pdf121.58 kBAdobe PDFView/Open
20172152_chapter3.pdf1.14 MBAdobe PDFView/Open
20172152_chapter4.pdf283.27 kBAdobe PDFView/Open
20172152_chapter5.pdf889.65 kBAdobe PDFView/Open
20172152_chapter6.pdf668.92 kBAdobe PDFView/Open
20172152_chapter7.pdf1.75 MBAdobe PDFView/Open
80_recommendation.pdf61.81 kBAdobe PDFView/Open
chapter10_anneures.pdf71.75 kBAdobe PDFView/Open
contents.pdf57.93 kBAdobe PDFView/Open
preliminary.pdf56.3 kBAdobe PDFView/Open
title.pdf33.08 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: