Duplicate Record Detection Using Soft Computing Approaches

Deepa K

Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/22920

Title:	Duplicate Record Detection Using Soft Computing Approaches
Researcher:	Deepa K
Guide(s):	Rangarajan R
Keywords:	ANFIS detection Duplicate Record Kmeans algorithm Soft Computing
Upload Date:	19-Aug-2014
University:	Anna University
Completed Date:	n.d.
Abstract:	The abundance of data produced and the requirement to merge newlinethem from different sources have resulted in the challenge of making the newlineefficient detection of the duplicate records in databases Since the data sources newlineare independent they may adopt their own conventions and often integrating newlinedata from different sources invariably leads to erroneous duplication of data newlineTo ensure high quality data the database must validate and cleanse the newlineincoming data from the external sources In this regard data cleaning has newlinebecome essential to ensure the quality of the data stored in the real world newlinedatabases The process of identifying the record pairs that represent the same newlineentity is commonly known as duplicate record detection This is one of the newlineimportant tasks of data cleaning newlineThe proposed work suggests several new approaches to improve the newlineaccuracy of the duplicate record detection process along with other wellknown newlinemeasures The first part of the work adopts Adaptive Neuro Fuzzy newlineInference Systems ANFIS for duplicate record detection by means of newlinesimilarity functions It is not only to reduce the time taken for making the newlinedecision for duplicate detection but also to reduce the time to hard code the newlinematching rules that ANFIS used It is necessary to adapt a similarity measure newlinefor each field of the database with respect to the particular data domain for newlineattaining the accurate similarity computations Consequently the proposed newlineapproach combines these similarity values obtained from different similarity newlinemeasures to compute the distance between any two records newlineIn traditional approach each record is selected and compared with newlinethe rest of the tuples one by one making it a time consuming process In order newlineto reduce the computational time the cleaned records are clustered by Kmeans newlinealgorithm by grouping the records most likely to be duplicates in one newlinegroup Thus all the possible pairs from each cluster are selected comparing newlineonly the records within each cluster ANFIS uses similarities for newlinecomparing a pair of records and detecting the dupl
Pagination:	xix ,175p
URI:	http://hdl.handle.net/10603/22920
Appears in Departments:	Faculty of Information and Communication Engineering

Files in This Item:

File	Description	Size	Format
01_title.pdf	Attached File	251.01 kB	Adobe PDF	View/Open
02_certificate.pdf		3.76 MB	Adobe PDF	View/Open
03_abstract.pdf		64.94 kB	Adobe PDF	View/Open
04_acknowledgement.pdf		57.24 kB	Adobe PDF	View/Open
05_contents.pdf		134.55 kB	Adobe PDF	View/Open
06_chapter 1.pdf		301.81 kB	Adobe PDF	View/Open
07_chapter 2.pdf		2.7 MB	Adobe PDF	View/Open
08_chapter 3.pdf		1.19 MB	Adobe PDF	View/Open
09_chapter 4.pdf		1.06 MB	Adobe PDF	View/Open
10_chapter 5.pdf		3.35 MB	Adobe PDF	View/Open
11_chapter 6.pdf		1.96 MB	Adobe PDF	View/Open
12_chapter 7.pdf		65.78 kB	Adobe PDF	View/Open
13_appendix.pdf		83.91 kB	Adobe PDF	View/Open
14_references.pdf		131.24 kB	Adobe PDF	View/Open
15_publications.pdf		59.77 kB	Adobe PDF	View/Open
16_vitae.pdf		53.37 kB	Adobe PDF	View/Open

Show full item record

Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

Shodhganga : a reservoir of Indian theses @ INFLIBNET