Please use this identifier to cite or link to this item:
http://hdl.handle.net/10603/496791
Title: | Text Classification for Telugu Datasets Embeddings and Models for Downstream NLP Tasks |
Researcher: | Mounika Marreddy |
Guide(s): | Radhika Mamidi |
Keywords: | Computer Science Computer Science Theory and Methods Engineering and Technology |
University: | International Institute of Information Technology, Hyderabad |
Completed Date: | 2023 |
Abstract: | Language understanding has become crucial in different text classification tasks in newlineNatural Language Processing (NLP) applications to get the desired output. Over the past newlinedecade, machine learning and deep learning algorithms have been evolving with efficient newlinefeature representations to give better results. The applications of NLP are becoming potent, newlinedomain, and language-specific. For resource-rich languages like English, the NLP applications newlinegive desired results due to the availability of large corpora, different kinds of annotated datasets, newlineefficient feature representations, and tools. newlineDue to the lack of large corpora and annotated datasets, many resource-poor Indian languages newlinestruggle to reap the benefits of deep feature representations. Moreover, adopting existing newlinelanguage models trained on large English corpora for Indian languages is often limited by data newlineavailability, rich morphological variation, syntax, and semantic differences. Most of the work newlinebeing done in Indian languages is from a machine translation perspective. One solution is to use newlinetranslation for re-creating datasets in low resource languages from English. But in case of Indian newlinelanguages like Telugu, the meaning may change and some crucial information may be lost due newlineto translation. This is because of their structural differences, morphological complexities, and newlinesemantic differences. newlineIn this thesis, our main objective is to mitigate the low-resource problem for Telugu. Overall, to newlineaccelerate NLP research in Telugu, we present several contributions: (1) A large Telugu raw newlinecorpus of 80,15,588 sentences (16,37,408 sentences from Telugu Wikipedia and 63,78,180 newlinesentences crawled from different Telugu websites). newline(2) Annotated datasets in Telugu for sentiment analysis, emotion identification, hate speech newlinedetection, sarcasm identification, and clickbait detection. newline(3) For the Telugu corpus, we are the first to generate pre-trained distributed word and sentence newlineembeddings such as \emph{Word2Vec-Te}, \emph{GloVe-Te}, \emph{FastText-Te}, newline\emph{MetaE |
Pagination: | |
URI: | http://hdl.handle.net/10603/496791 |
Appears in Departments: | Computer Science and Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
20172152_abstract.pdf | Attached File | 52.3 kB | Adobe PDF | View/Open |
20172152_chapter1.pdf | 766.42 kB | Adobe PDF | View/Open | |
20172152_chapter2.pdf | 121.58 kB | Adobe PDF | View/Open | |
20172152_chapter3.pdf | 1.14 MB | Adobe PDF | View/Open | |
20172152_chapter4.pdf | 283.27 kB | Adobe PDF | View/Open | |
20172152_chapter5.pdf | 889.65 kB | Adobe PDF | View/Open | |
20172152_chapter6.pdf | 668.92 kB | Adobe PDF | View/Open | |
20172152_chapter7.pdf | 1.75 MB | Adobe PDF | View/Open | |
80_recommendation.pdf | 61.81 kB | Adobe PDF | View/Open | |
chapter10_anneures.pdf | 71.75 kB | Adobe PDF | View/Open | |
contents.pdf | 57.93 kB | Adobe PDF | View/Open | |
preliminary.pdf | 56.3 kB | Adobe PDF | View/Open | |
title.pdf | 33.08 kB | Adobe PDF | View/Open |
Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Altmetric Badge: