Recall Oriented Approaches for improved Indian Language Information Access

Pingali V.V. Prasad Rao

Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/544188

Title:	Recall Oriented Approaches for improved Indian Language Information Access
Researcher:	Pingali V.V. Prasad Rao
Guide(s):	Vasudeva Varma
Keywords:	Engineering Engineering and Technology Engineering Multidisciplinary
University:	International Institute of Information Technology, Hyderabad
Completed Date:	2009
Abstract:	This thesis is an investigation into Indian language information access. The investigation shows that Indian language information access technologies face severe recall problem when using conventional IR techniques (used for English-like languages). newlineDuring this investigation we crawled the web extensively for Indian languages, characterized the Indian language web and in the process came up with some solutions for newlinethe low recall problem. We focused our investigation on the loss of recall in monolingual and cross-lingual based IR and text summarization. The following are some of newlinethe major contributions of this thesis. newline We built a language focused web crawler that can optimally find and fetch pages newlineof a given language using syllable based IR index model. newline We conducted a language focused crawl for a period of 6 months to analyze the newlinecharacteristics and distribution of Indian language content on the web. newline We crawled the web for a continuous period of 2 years to collect a large corpus newlineof Indian language web pages, which was used to build IR and summarization newlineevaluation datasets. newline We showed that Indian language information access technologies that use stateof-the-art technologies used by English like languages, face low recall. We observed the recall loss to be relatively higher when the target language corpus is newlineEnglish. newline We came up with a semi-automatic method of converting proprietary encoded newlinetext into UTF-8 which otherwise was causing low-recall in document retrieval.We came up with a unified information access framework which can address newlinethe problems of Monolingual and Cross-lingual Information Retrieval and Text newlineSummarization. newline We defined a single index model that can be used for many scoring functions newlineused in our Monolingual and Cross-lingual IR and Summarization. This index newlinemodel can help in identifying languages of a document, handling morphological newlinevariations, spelling variations, co-occurrence based query term expansion and newlineweighted term dictionary translation. The score of each document or sentence newline(i
Pagination:	205
URI:	http://hdl.handle.net/10603/544188
Appears in Departments:	Computational Linguistics

Files in This Item:

File	Description	Size	Format
80_recommendation.pdf	Attached File	85.98 kB	Adobe PDF	View/Open
abstract.pdf		64.57 kB	Adobe PDF	View/Open
annexures.pdf		97.8 kB	Adobe PDF	View/Open
chapter 1.pdf		458.2 kB	Adobe PDF	View/Open
chapter 2.pdf		226.38 kB	Adobe PDF	View/Open
chapter 3.pdf		421.66 kB	Adobe PDF	View/Open
chapter 4.pdf		303.5 kB	Adobe PDF	View/Open
chapter 5.pdf		216.84 kB	Adobe PDF	View/Open
chapter 6.pdf		257.04 kB	Adobe PDF	View/Open
chapter 7.pdf		76.62 kB	Adobe PDF	View/Open
content.pdf		60.16 kB	Adobe PDF	View/Open
preliminary pages.pdf		70.25 kB	Adobe PDF	View/Open
title page.pdf		57.04 kB	Adobe PDF	View/Open

Show full item record

Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge:

Shodhganga : a reservoir of Indian theses @ INFLIBNET