Extending dependencies for improving data quality

ridm@nrct.go.th ระบบคลังข้อมูลงานวิจัยไทย รายการโปรดที่คุณเลือกไว้

Extending dependencies for improving data quality

หน่วยงาน Edinburgh Research Archive, United Kingdom

รายละเอียด

ชื่อเรื่อง	:	Extending dependencies for improving data quality
นักวิจัย	:	Ma, Shuai
คำค้น	:	data quality , data repairing , data dependencies
หน่วยงาน	:	Edinburgh Research Archive, United Kingdom
ผู้ร่วมงาน	:	Fan, Wenfei
ปีพิมพ์	:	2554
อ้างอิง	:	http://hdl.handle.net/1842/5045
ที่มา	:	-
ความเชี่ยวชาญ	:	-
ความสัมพันธ์	:	Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and YinghuiWu. Adding Regular Expressions to Graph Reachability and Pattern Queries. In Proceedings of the 27th International Conference on Data Engineering (ICDE), Hannover, Germany, 2011 , Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, Yinghui Wu, and Yun-pengWu. Graph Pattern Matching: From Intractable to Polynomial Time. Proceedings of the VLDB Endowment (PVLDB), Volume 3, Singapore, 2010. , Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, and Wenyun Yu. Towards Certain Fixes with Editing Rules and Master Data. Proceedings of the VLDB Endowment (PVLDB), Volume 3, Singapore, 2010. , Wenfei Fan, Jianzhong Li, Shuai Ma, Hongzhi Wang, and Yinghui Wu Graph Homomorphism Revisited for Graph Matching. Proceedings of the VLDB Endowment (PVLDB), Volume 3, Singapore, 2010. , Wenfei Fan, Floris Geerts, Shuai Ma, and Heiko M¨uller. Detecting Inconsistencies in Distributed Data. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE), California, USA, 2010. , Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. Reasoning about Record Matching Rules. Proceedings of the VLDB Endowment (PVLDB), Volume 2, France, 2009. , Wenguang Chen, Wenfei Fan, and Shuai Ma. Analyses and Validation of Conditional Dependencies with Built-in Predicates. In Proceedings of the 20th International Conference on Database and Expert Systems Applications (DEXA), Austria, 2009. , Wenguang Chen, Wenfei Fan, and Shuai Ma. Incorporating Cardinality Constraints and Synonym Rules into Conditional Functional Dependencies. Information Processing Letters, 109(14), 783–789, 2009. , Wenfei Fan, Shuai Ma, Yanli Hu, Jie Liu, and Yinghui Wu. Propagating Functional Dependencies with Conditions. Proceedings of the VLDB Endowment (PVLDB), Volume 1, New Zealand, 2008. , Loreto Bravo, Wenfei Fan, Floris Geerts, and Shuai Ma. Increasing Expressivity of Conditional Functional Dependencies without Extra Charge Complexity. In Proceedings of the 24th International Conference on Data Engineering (ICDE), Cancun, Mexico, 2008. , Cong Gao, Wenfei Fan, Floris Geerts, Xibei Jia, and Shuai Ma. Improving Data Quality: Consistency and Accuracy. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB), Austria, 2007. , Loreto Bravo, Wenfei Fan, and Shuai Ma. Extending Dependencies with Conditions. In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB), Austria, 2007. , Gao Cong, Wenfei Fan, Xibei Jia, and Shuai Ma. PRATA: A System for XML Publishing, Integration and View Maintenance (poster paper). In Proceedings of the UK e-Science All Hands Meeting, Nottingham, UK, 2006.
ขอบเขตของเนื้อหา	:	-
บทคัดย่อ/คำอธิบาย	:	This doctoral thesis presents the results of my work on extending dependencies for improving data quality, both in a centralized environment with a single database and in a data exchange and integration environment with multiple databases. The first part of the thesis proposes five classes of data dependencies, referred to as CINDs, eCFDs, CFDcs, CFDps and CINDps, to capture data inconsistencies commonly found in practice in a centralized environment. For each class of these dependencies, we investigate two central problems: the satisfiability problem and the implication problem. The satisfiability problem is to determine given a set Σ of dependencies defined on a database schema R, whether or not there exists a nonempty database D of R that satisfies Σ. And the implication problem is to determine whether or not a set Σ of dependencies defined on a database schema R entails another dependency φ on R. That is, for each database D ofRthat satisfies Σ, the D must satisfy φ as well. These are important for the validation and optimization of data-cleaning processes. We establish complexity results of the satisfiability problem and the implication problem for all these five classes of dependencies, both in the absence of finite-domain attributes and in the general setting with finite-domain attributes. Moreover, SQL-based techniques are developed to detect data inconsistencies for each class of the proposed dependencies, which can be easily implemented on the top of current database management systems. The second part of the thesis studies three important topics for data cleaning in a data exchange and integration environment with multiple databases. One is the dependency propagation problem, which is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. We investigate dependency propagation for views defined in various fragments of relational algebra, conditional functional dependencies (CFDs) [FGJK08] as view dependencies, and for source dependencies given as either CFDs or traditional functional dependencies (FDs). And we establish lower and upper bounds, all matching, ranging from PTIME to undecidable. These not only provide the first results for CFD propagation, but also extend the classical work of FD propagation by giving new complexity bounds in the presence of a setting with finite domains. We finally provide the first algorithm for computing a minimal cover of all CFDs propagated via SPC views. The algorithm has the same complexity as one of the most efficient algorithms for computing a cover of FDs propagated via a projection view, despite the increased expressive power of CFDs and SPC views. Another one is matching records from unreliable data sources. A class of matching dependencies (MDs) is introduced for specifying the semantics of unreliable data. As opposed to static constraints for schema design such as FDs, MDs are developed for record matching, and are defined in terms of similarity metrics and a dynamic semantics. We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. We also propose a mechanism for inferring MDs with a sound and complete system, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. We finally provide a quadratic time algorithm for inferring MDs, and an effective algorithm for deducing quality RCKs from a given set of MDs. The last one is finding certain fixes for data monitoring [CGGM03, SMO07], which is to find and correct errors in a tuple when it is created, either entered manually or generated by some process. That is, we want to ensure that a tuple t is clean before it is used, to prevent errors introduced by adding t. As noted by [SMO07], it is far less costly to correct a tuple at the point of entry than fixing it afterward. Data repairing based on integrity constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct.
บรรณานุกรม	:	APA Chicago MLA Vancouver Ma, Shuai . (2554). Extending dependencies for improving data quality. กรุงเทพมหานคร : Edinburgh Research Archive, United Kingdom . Ma, Shuai . 2554. "Extending dependencies for improving data quality". กรุงเทพมหานคร : Edinburgh Research Archive, United Kingdom . Ma, Shuai . "Extending dependencies for improving data quality." กรุงเทพมหานคร : Edinburgh Research Archive, United Kingdom , 2554. Print. Ma, Shuai . Extending dependencies for improving data quality. กรุงเทพมหานคร : Edinburgh Research Archive, United Kingdom ; 2554.