Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/544188
Title: Recall Oriented Approaches for improved Indian Language Information Access
Researcher: Pingali V.V. Prasad Rao
Guide(s): Vasudeva Varma
Keywords: Engineering
Engineering and Technology
Engineering Multidisciplinary
University: International Institute of Information Technology, Hyderabad
Completed Date: 2009
Abstract: This thesis is an investigation into Indian language information access. The investigation shows that Indian language information access technologies face severe recall problem when using conventional IR techniques (used for English-like languages). newlineDuring this investigation we crawled the web extensively for Indian languages, characterized the Indian language web and in the process came up with some solutions for newlinethe low recall problem. We focused our investigation on the loss of recall in monolingual and cross-lingual based IR and text summarization. The following are some of newlinethe major contributions of this thesis. newline We built a language focused web crawler that can optimally find and fetch pages newlineof a given language using syllable based IR index model. newline We conducted a language focused crawl for a period of 6 months to analyze the newlinecharacteristics and distribution of Indian language content on the web. newline We crawled the web for a continuous period of 2 years to collect a large corpus newlineof Indian language web pages, which was used to build IR and summarization newlineevaluation datasets. newline We showed that Indian language information access technologies that use stateof-the-art technologies used by English like languages, face low recall. We observed the recall loss to be relatively higher when the target language corpus is newlineEnglish. newline We came up with a semi-automatic method of converting proprietary encoded newlinetext into UTF-8 which otherwise was causing low-recall in document retrieval.We came up with a unified information access framework which can address newlinethe problems of Monolingual and Cross-lingual Information Retrieval and Text newlineSummarization. newline We defined a single index model that can be used for many scoring functions newlineused in our Monolingual and Cross-lingual IR and Summarization. This index newlinemodel can help in identifying languages of a document, handling morphological newlinevariations, spelling variations, co-occurrence based query term expansion and newlineweighted term dictionary translation. The score of each document or sentence newline(i
Pagination: 205
URI: http://hdl.handle.net/10603/544188
Appears in Departments:Computational Linguistics

Files in This Item:
File Description SizeFormat 
80_recommendation.pdfAttached File85.98 kBAdobe PDFView/Open
abstract.pdf64.57 kBAdobe PDFView/Open
annexures.pdf97.8 kBAdobe PDFView/Open
chapter 1.pdf458.2 kBAdobe PDFView/Open
chapter 2.pdf226.38 kBAdobe PDFView/Open
chapter 3.pdf421.66 kBAdobe PDFView/Open
chapter 4.pdf303.5 kBAdobe PDFView/Open
chapter 5.pdf216.84 kBAdobe PDFView/Open
chapter 6.pdf257.04 kBAdobe PDFView/Open
chapter 7.pdf76.62 kBAdobe PDFView/Open
content.pdf60.16 kBAdobe PDFView/Open
preliminary pages.pdf70.25 kBAdobe PDFView/Open
title page.pdf57.04 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: