Please use this identifier to cite or link to this item:
http://hdl.handle.net/10603/544188
Title: | Recall Oriented Approaches for improved Indian Language Information Access |
Researcher: | Pingali V.V. Prasad Rao |
Guide(s): | Vasudeva Varma |
Keywords: | Engineering Engineering and Technology Engineering Multidisciplinary |
University: | International Institute of Information Technology, Hyderabad |
Completed Date: | 2009 |
Abstract: | This thesis is an investigation into Indian language information access. The investigation shows that Indian language information access technologies face severe recall problem when using conventional IR techniques (used for English-like languages). newlineDuring this investigation we crawled the web extensively for Indian languages, characterized the Indian language web and in the process came up with some solutions for newlinethe low recall problem. We focused our investigation on the loss of recall in monolingual and cross-lingual based IR and text summarization. The following are some of newlinethe major contributions of this thesis. newline We built a language focused web crawler that can optimally find and fetch pages newlineof a given language using syllable based IR index model. newline We conducted a language focused crawl for a period of 6 months to analyze the newlinecharacteristics and distribution of Indian language content on the web. newline We crawled the web for a continuous period of 2 years to collect a large corpus newlineof Indian language web pages, which was used to build IR and summarization newlineevaluation datasets. newline We showed that Indian language information access technologies that use stateof-the-art technologies used by English like languages, face low recall. We observed the recall loss to be relatively higher when the target language corpus is newlineEnglish. newline We came up with a semi-automatic method of converting proprietary encoded newlinetext into UTF-8 which otherwise was causing low-recall in document retrieval.We came up with a unified information access framework which can address newlinethe problems of Monolingual and Cross-lingual Information Retrieval and Text newlineSummarization. newline We defined a single index model that can be used for many scoring functions newlineused in our Monolingual and Cross-lingual IR and Summarization. This index newlinemodel can help in identifying languages of a document, handling morphological newlinevariations, spelling variations, co-occurrence based query term expansion and newlineweighted term dictionary translation. The score of each document or sentence newline(i |
Pagination: | 205 |
URI: | http://hdl.handle.net/10603/544188 |
Appears in Departments: | Computational Linguistics |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
80_recommendation.pdf | Attached File | 85.98 kB | Adobe PDF | View/Open |
abstract.pdf | 64.57 kB | Adobe PDF | View/Open | |
annexures.pdf | 97.8 kB | Adobe PDF | View/Open | |
chapter 1.pdf | 458.2 kB | Adobe PDF | View/Open | |
chapter 2.pdf | 226.38 kB | Adobe PDF | View/Open | |
chapter 3.pdf | 421.66 kB | Adobe PDF | View/Open | |
chapter 4.pdf | 303.5 kB | Adobe PDF | View/Open | |
chapter 5.pdf | 216.84 kB | Adobe PDF | View/Open | |
chapter 6.pdf | 257.04 kB | Adobe PDF | View/Open | |
chapter 7.pdf | 76.62 kB | Adobe PDF | View/Open | |
content.pdf | 60.16 kB | Adobe PDF | View/Open | |
preliminary pages.pdf | 70.25 kB | Adobe PDF | View/Open | |
title page.pdf | 57.04 kB | Adobe PDF | View/Open |
Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Altmetric Badge: