Lossless Data Deduplication: Alternatif Solusi untuk Mengatasi Duplicated Record

Ardijan Handijono

Lossless Data Deduplication: Alternatif Solusi untuk Mengatasi Duplicated Record

Penulis

Ardijan Handijono Universitas Pamulang

Kata Kunci:

Duplikat Record, Deduplikasi, Pembersihan Data, Identifikasi Supervised Duplicated Records, Data Warehouse

Abstrak

Implikasi dari kualitas data yang buruk membawa dampak negatif bagi organisasi melalui: Meningkatnya biaya operasional, proses pengambilan keputusan yang tidak efisien, kinerja yang lebih rendah dan penurunan kepuasan karyawan dan pelanggan. Umumnya duplicated records dapat ditangani dengan eliminasi atau penggabungan, tetapi ketika duplicated records terjadi pada table master dan telah digunakan dalam transaksi penanganan menjadi tidak mudah. Makalah ini berupaya memberikan solusi deduplikasi data tanpa kehilangan nilai historis transaksi. Untuk menyimpan semua duplicated records yang terkait dengan suatu transaksi kami menggunakan tabel pemetaan antara tabel Dimensi dan tabel Facs. Dengan pendekatan ini kualitas tabel Dimensi meningkat karena pada proses penanganan duplicated records ini termasuk proses pengayaan dan menghapus record-record kotor dan semua duplicated records yang terkait dengan transaksi dapat diakses sepenuhnya di Data Warehouse, tidak ada data transaksi yang hilang.

Biografi Penulis

Ardijan Handijono, Universitas Pamulang

Akuntansi S1

Referensi

Babu, K. (2012). Business intelligence: Concepts, components, techniques and benefits. Components, Techniques and Benefits (September 22, 2012).

Babu, S. A. (2017). Duplicate Record Detection and Replacement within a Relational Database. Advances in Computational Sciences and Technology, 10(6), 1893-1901.

Bajpai, J., & Metkewar, P. S. (2016). Data quality issues and current approaches to data cleaning process in data warehousing. Glob. Res. Dev. J. Eng, 1(10), 14-18.

Chandrasekar, C. (2013). An optimized approach of modified bat algorithm to record deduplication. International Journal of Computer Applications, 62(1).

Culotta, A., & McCallum, A. (2005). Joint deduplication of multiple record types in relational data. Paper presented at the Proceedings of the 14th ACM international conference on Information and knowledge management.

Elkington, D., Zeng, X., & Morris, R. (2016). Resolving and merging duplicate records using machine learning. In: Google Patents.

Elkington, D. R., Zeng, X., & Morris, R. G. (2014). Resolving and merging duplicate records using machine learning. In: Google Patents.

Fleckenstein, M., & Fellows, L. (2018). Data Warehousing and Business Intelligence. In Modern Data Strategy (pp. 121-131): Springer.

Haug, A., Zachariassen, F., & Van Liempd, D. (2011). The costs of poor data quality. Journal of Industrial Engineering and Management (JIEM), 4(2), 168-193.

Ker, C., VAISHNAV, P., & Dvinov, D. (2017). Merging multiple groups of records containing duplicates. In: Google Patents.

Kimball, R., & Caserta, J. (2011). The data warehouse ETL toolkit: practical techniques for extracting, cleaning, conforming, and delivering data: John Wiley & Sons.

Kimball, R., & Ross, M. (2013). The data warehouse toolkit: The definitive guide to dimensional modeling: John Wiley & Sons.

Marsh, R. (2005). Drowning in dirty data? It's time to sink or swim: A four-stage methodology for total data quality management. Journal of Database Marketing & Customer Strategy Management, 12(2), 105-112.

Meadows, A., Pulvirenti, A. n. S., & RoldaÂ´, M. a. C. (2013). Pentaho Data Integration Cookbook : Over 100 Recipes for Building Open Source ETL Solutions with Pentaho Data Integration (Vol. Second edition). Birmingham: Packt Publishing.

Papenbrock, T., Heise, A., & Naumann, F. (2014). Progressive duplicate detection. IEEE Transactions on knowledge and data engineering, 27(5), 1316-1329.

Santos, V., & Belo, O. (2011). No need to type slowly changing dimensions. Paper presented at the IADIS International Conference Information Systems.

Sitas, A., & Kapidakis, S. (2008). Duplicate detection algorithms of bibliographic descriptions. Library Hi Tech, 26(2), 287-301.

Skandar, A., Rehman, M., & Anjum, M. (2015). An Efficient Duplication Record Detection Algorithm for Data Cleansing. International Journal of Computer Applications, 127(6), 28-37.

Tamilselvi, J. J., & Gifta, C. B. (2011). Handling duplicate data in data warehouse for data mining. International Journal of Computer Applications, 15(4), 7-15.

Tamilselvi, J. J., & Saravanan, V. (2009). Detection and elimination of duplicate data using token-based method for a data warehouse: A clustering based approach. International Journal of Dynamics of Fluids, 5(2), 145-164.

Weis, M., Naumann, F., Jehle, U., Lufter, J., & Schuster, H. (2008). Industry-scale duplicate detection. Proceedings of the VLDB Endowment, 1(2), 1253-1264.

Unduhan

Diterbitkan

2020-02-04

Cara Mengutip

Handijono, A. (2020). Lossless Data Deduplication: Alternatif Solusi untuk Mengatasi Duplicated Record. Jurnal Teknologi Sistem Informasi Dan Aplikasi, 3(1), 33–41. Diambil dari https://openjournal.unpam.ac.id/index.php/JTSI/article/view/4223

Unduh Sitasi

Terbitan

Vol 3 No 1 (2020): Jurnal Teknologi Sistem Informasi dan Aplikasi

Bagian

Article

Lisensi

Artikel ini berlisensi Creative Commons Attribution-NonCommercial 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

Jurnal Teknologi Sistem Informasi dan Aplikasi have CC BY-NC or an equivalent license as the optimal license for the publication, distribution, use, and reuse of scholarly work.

In developing strategy and setting priorities, Jurnal Teknologi Sistem Informasi dan Aplikasi recognize that free access is better than priced access, libre access is better than free access, and libre under CC BY-NC or the equivalent is better than libre under more restrictive open licenses. We should achieve what we can when we can. We should not delay achieving free in order to achieve libre, and we should not stop with free when we can achieve libre.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) License

YOU ARE FREE TO:

Share - copy and redistribute the material in any medium or format
Adapt - remix, transform, and build upon the material for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms

Lossless Data Deduplication: Alternatif Solusi untuk Mengatasi Duplicated Record

Penulis

Kata Kunci:

Abstrak

Biografi Penulis

Ardijan Handijono, Universitas Pamulang

Referensi

Unduhan

Diterbitkan

Cara Mengutip

Terbitan

Bagian

Lisensi

YOU ARE FREE TO:

certificate

template

AdditionalMenu

indexing

GoogleScholarCitation

statistics

SupportingTools

SupportedBy

Terbitan Terkini

Bahasa