Lossless Data Deduplication: Alternatif Solusi untuk Mengatasi Duplicated Record

Authors

  • Ardijan Handijono Universitas Pamulang

Keywords:

Duplicated Records, Deduplication, Data Cleansing, Supervised Duplicated Records Identification, Data Warehouse

Abstract

The implications of poor data quality bring negative effects for organisation through: increased operational costs, inefficient decision-making processes, lower performance and decreased both employee and customer satisfaction. Generally duplicated records can be handled by elimination or merge, but when duplicated records are occur in master table and used in a transaction the handling becomes not easy. This paper seeks to provide data deduplication solutions without losing the historical value of the transaction. To save all duplicated records which related to the transactions we use mapping table between Dimension table and facs table. Using this approach the quality of Dimension table increased since for this handling duplicated records process include enrichment process and delete dirty records and all dupplicated records which related to the transactions can be access completely in data warehouse, no transaction data loss.

Author Biography

Ardijan Handijono, Universitas Pamulang

Akuntansi S1

References

Babu, K. (2012). Business intelligence: Concepts, components, techniques and benefits. Components, Techniques and Benefits (September 22, 2012).

Babu, S. A. (2017). Duplicate Record Detection and Replacement within a Relational Database. Advances in Computational Sciences and Technology, 10(6), 1893-1901.

Bajpai, J., & Metkewar, P. S. (2016). Data quality issues and current approaches to data cleaning process in data warehousing. Glob. Res. Dev. J. Eng, 1(10), 14-18.

Chandrasekar, C. (2013). An optimized approach of modified bat algorithm to record deduplication. International Journal of Computer Applications, 62(1).

Culotta, A., & McCallum, A. (2005). Joint deduplication of multiple record types in relational data. Paper presented at the Proceedings of the 14th ACM international conference on Information and knowledge management.

Elkington, D., Zeng, X., & Morris, R. (2016). Resolving and merging duplicate records using machine learning. In: Google Patents.

Elkington, D. R., Zeng, X., & Morris, R. G. (2014). Resolving and merging duplicate records using machine learning. In: Google Patents.

Fleckenstein, M., & Fellows, L. (2018). Data Warehousing and Business Intelligence. In Modern Data Strategy (pp. 121-131): Springer.

Haug, A., Zachariassen, F., & Van Liempd, D. (2011). The costs of poor data quality. Journal of Industrial Engineering and Management (JIEM), 4(2), 168-193.

Ker, C., VAISHNAV, P., & Dvinov, D. (2017). Merging multiple groups of records containing duplicates. In: Google Patents.

Kimball, R., & Caserta, J. (2011). The data warehouse ETL toolkit: practical techniques for extracting, cleaning, conforming, and delivering data: John Wiley & Sons.

Kimball, R., & Ross, M. (2013). The data warehouse toolkit: The definitive guide to dimensional modeling: John Wiley & Sons.

Marsh, R. (2005). Drowning in dirty data? It's time to sink or swim: A four-stage methodology for total data quality management. Journal of Database Marketing & Customer Strategy Management, 12(2), 105-112.

Meadows, A., Pulvirenti, A. n. S., & Rolda´, M. a. C. (2013). Pentaho Data Integration Cookbook : Over 100 Recipes for Building Open Source ETL Solutions with Pentaho Data Integration (Vol. Second edition). Birmingham: Packt Publishing.

Papenbrock, T., Heise, A., & Naumann, F. (2014). Progressive duplicate detection. IEEE Transactions on knowledge and data engineering, 27(5), 1316-1329.

Santos, V., & Belo, O. (2011). No need to type slowly changing dimensions. Paper presented at the IADIS International Conference Information Systems.

Sitas, A., & Kapidakis, S. (2008). Duplicate detection algorithms of bibliographic descriptions. Library Hi Tech, 26(2), 287-301.

Skandar, A., Rehman, M., & Anjum, M. (2015). An Efficient Duplication Record Detection Algorithm for Data Cleansing. International Journal of Computer Applications, 127(6), 28-37.

Tamilselvi, J. J., & Gifta, C. B. (2011). Handling duplicate data in data warehouse for data mining. International Journal of Computer Applications, 15(4), 7-15.

Tamilselvi, J. J., & Saravanan, V. (2009). Detection and elimination of duplicate data using token-based method for a data warehouse: A clustering based approach. International Journal of Dynamics of Fluids, 5(2), 145-164.

Weis, M., Naumann, F., Jehle, U., Lufter, J., & Schuster, H. (2008). Industry-scale duplicate detection. Proceedings of the VLDB Endowment, 1(2), 1253-1264.

Published

2020-02-04

How to Cite

Handijono, A. (2020). Lossless Data Deduplication: Alternatif Solusi untuk Mengatasi Duplicated Record. Jurnal Teknologi Sistem Informasi Dan Aplikasi, 3(1), 33–41. Retrieved from https://openjournal.unpam.ac.id/index.php/JTSI/article/view/4223