Inproceedings,

Quality and complexity measures for data linkage and deduplication

P. Christen, and K. Goiser.
in Quality Measures in Data Mining, ser. Studies in Computational Intelligence, page 127--151. Springer, (2007)

Abstract

Summary. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity. Key words: data or record linkage, data integration and matching, deduplication, data mining pre-processing, quality and complexity measures 1

BibTeX key: Christen07qualityand
entry type: inproceedings
booktitle: in Quality Measures in Data Mining, ser. Studies in Computational Intelligence
year: 2007
pages: 127--151
publisher: Springer
series: Studies in Computational Intelligence
url: http://dblp.uni-trier.de/db/series/sci/sci43.html#ChristenG07

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@inproceedings{Christen07qualityand, abstract = {Summary. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity. Key words: data or record linkage, data integration and matching, deduplication, data mining pre-processing, quality and complexity measures 1}, added-at = {2018-07-11T17:10:05.000+0200}, author = {Christen, Peter and Goiser, Karl}, biburl = {https://puma.ub.uni-stuttgart.de/bibtex/2b6120f43d3be34047fafe5e50abf56dd/diglezakis}, booktitle = {in Quality Measures in Data Mining, ser. Studies in Computational Intelligence}, description = {CiteSeerX — Quality and complexity measures for data linkage and deduplication}, interhash = {8d702a102574e5e29f6b1315cc0c0d4c}, intrahash = {b6120f43d3be34047fafe5e50abf56dd}, keywords = {forschungsdaten quality}, pages = {127--151}, publisher = {Springer}, series = {Studies in Computational Intelligence}, timestamp = {2018-07-11T15:10:05.000+0200}, title = {Quality and complexity measures for data linkage and deduplication}, url = {http://dblp.uni-trier.de/db/series/sci/sci43.html#ChristenG07}, year = 2007 }

PUMA

Quality and complexity measures for data linkage and deduplication

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on