An Artificial Intelligence-Based Data Integration Framework for Real-Time Cross-Source Data Harmonization
DOI:
https://doi.org/10.71426/jcdt.v2.i1.pp131-139Keywords:
Data ecosystems, Data transformation, Data Harmonization, Artificial Intelligence, World Development Indicators.Abstract
The rapid growth of digital data ecosystems has intensified the need for efficient methods that can integrate and harmonize heterogeneous data sources in real time. Conventional rule-based data integration pipelines often struggle to handle schema heterogeneity, semantic inconsistencies, and incomplete records across distributed data repositories. This study proposes an artificial intelligence based data integration framework designed for real-time cross-source data harmonization across heterogeneous data environments. The proposed framework integrates schema matching, entity resolution, and data transformation modules into a unified pipeline that combines lexical similarity, semantic normalization, and attribute-profile consistency analysis. To evaluate the effectiveness of the framework, experiments were conducted using a dataset extracted from the World Development Indicators (WDI) repository. Since the available dataset represented a single source extract, a heterogeneous secondary schema was constructed through controlled attribute renaming and semantic perturbation in order to emulate realistic cross-source integration scenarios. The experimental evaluation assessed schema matching accuracy, entity resolution performance, integration latency, and data completeness. The results demonstrate that the proposed AI framework significantly outperforms conventional integration baselines. Specifically, the framework achieved a schema matching accuracy of 1.000 compared to 0.857 for similarity-based matching and 0.571 for lexical rule-based matching. In the entity resolution task, the framework obtained perfect precision, recall, and F1-score, while baseline approaches exhibited substantial performance degradation under heterogeneous naming conditions. Although the proposed system incurred a modest increase in computational latency (12.4 ms) relative to lightweight baselines, the latency remained within real-time operational limits. Additionally, the harmonization process improved dataset completeness from 91.67% to 100%.
References
[1] Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. VLDB Journal. 2001;10(4):334–350. Available from: https://doi.org/10.1007/s007780100057
[2] Do HH, Rahm E. COMA: A system for flexible combination of schema matching approaches. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. 2002 Jan 1:610–621. Available from: https://doi.org/10.1016/B978-155860869-6/50060-3
[3] Xue AY, Zhang R, Zheng Y, Xie X, Huang J, Xu Z. Destination prediction by sub-trajectory synthesis and privacy protection against such prediction. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 2013 Apr 8:254–265. IEEE. Available from: https://ieeexplore.ieee.org/document/6544830
[4] Mahmood AR, Aly AM, Aref WG. FAST: Frequency-aware indexing for spatio-textual data streams. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 2018 Apr 16:305–316. IEEE. Available from: https://ieeexplore.ieee.org/document/8509257
[5] Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019 Jun:4171–4186. Available from: https://doi.org/10.18653/v1/N19-1423
[6] Ziegler P, Dittrich KR. Data integration—problems, approaches, and perspectives. In Conceptual Modelling in Information Systems Engineering. Springer Berlin Heidelberg; 2007:39–58. Available from: https://doi.org/10.1007/978-3-540-72677-7_3
[7] Batini C, Lenzerini M, Navathe SB. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys. 1986. Available from: https://doi.org/10.1145/27633.27634
[8] Hogan A, Blomqvist E, Cochez M, d’Amato C, Melo GD, Gutierrez C, Kirrane S, Gayo JE, Navigli R, Neumaier S, Ngomo AC. Knowledge graphs. ACM Computing Surveys. 2021 Jul 2;54(4):1–37. Available from: https://doi.org/10.1145/3447772
[9] Ji S, Pan S, Cambria E, Marttinen P, Yu PS. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems. 2021 Apr 26;33(2):494–514. Available from: https://doi.org/10.1109/TNNLS.2021.3070843
[10] Edara VS, Reddy SR, Akshaya GN, Koteswari OL, Sreeja T. Leveraging sentiment analysis in the digital era: Uncovering insights from unstructured data for enhanced customer engagement. Journal of Modern Technology. 2025 Apr 20;2(01):212–219. Available from: https://doi.org/10.71426/jmt.v2.i1.pp212-219
[11] Fagin R, Kolaitis P. Data exchange: Semantics and query answering. Theoretical Computer Science. 2005 May 25;336(1):89–124. Available from: https://doi.org/10.1016/j.tcs.2004.10.033
[12] Crescenzi V, De Angelis A, Firmani D, Mazzei M, Merialdo P, Piai F, Srivastava D. Alaska: A flexible benchmark for data integration tasks. arXiv preprint. 2021 Jan 27. Available from: https://doi.org/10.48550/arXiv.2101.11259
[13] Miller RJ. Open data integration. Proceedings of the VLDB Endowment. 2018 Aug 1;11(12):2130–2139. Available from: https://doi.org/10.14778/3229863.3240491
[14] Lenzerini M. Data integration: A theoretical perspective. In Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 2002 Jun 3:233–246. Available from: https://doi.org/10.1145/543613.543644
[15] Penaganti R. Graph neural network-based framework for real-time financial fraud detection in digital payment ecosystems. Journal of Computing and Data Technology. 2025;1(2):91–97. Available from: https://doi.org/10.71426/jcdt.v1.i2.pp91-97
[16] Halevy A, Rajaraman A, Ordille J. Data integration: The teenage years. In Proceedings of the 32nd International Conference on Very Large Data Bases. 2006 Sep 1:9–16. Available from: https://dl.acm.org/doi/10.5555/1182635.1164130
[17] Christen P. The data matching process. In Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer Berlin Heidelberg; 2012 Jul 5:23–35. Available from: https://link.springer.com/book/10.1007/978-3-642-31164-2
[18] Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering. 2007;19(1):1–16. Available from: https://doi.org/10.1109/TKDE.2007.250581
[19] Luo Y, Nie T, Shen D, Kou Y, Yu G. A progressive method for detecting duplication entities based on Bloom filters. In 2017 14th Web Information Systems and Applications Conference (WISA). 2017 Nov 11:273–278. IEEE. Available from: https://ieeexplore.ieee.org/document/8332629
[20] Nargesian F, Zhu E, Pu KQ, Miller RJ. Table union search on open data. Proceedings of the VLDB Endowment. 2018 Mar 1;11(7):813–825. Available from: https://doi.org/10.14778/3192965.3192973
[21] Kim W, Choi BJ, Hong EK, Kim SK, Lee D. A taxonomy of dirty data. Data Mining and Knowledge Discovery. 2003 Jan;7(1):81–99. Available from: https://doi.org/10.1023/A:1021564703268
[22] Bergamaschi S, Beneventano D, Mandreoli F, Martoglia R, Guerra F, Orsini M, Po L, Vincini M, Simonini G, Zhu S, Gagliardelli L. From data integration to big data integration. In A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years. Springer International Publishing; 2017 May 31:43–59. Available from: https://doi.org/10.1007/978-3-319-61893-7_3
[23] Kong C, Gao M, Xu C, Qian W, Zhou A. Entity matching across multiple heterogeneous data sources. In International Conference on Database Systems for Advanced Applications. Springer International Publishing; 2016 Mar 25:133–146. Available from: https://doi.org/10.1007/978-3-319-32025-0_9
[24] Moslemi MH, Mousavi A, Behkamal B, Milani M. Heterogeneity in entity matching: A survey and experimental analysis. Data & Knowledge Engineering. 2026 Feb 5;164:102575. Available from: https://doi.org/10.1016/j.datak.2026.102575
[25] Prince WC, Fantom NJ. World development indicators 2014 (English). World Development Indicators. Washington, DC: World Bank Group. Available from: http://documents.worldbank.org/curated/en/752121468182353172
[26] Paraskevas K. Data integration and storage strategies in heterogeneous analytical systems: Architectures, methods, and interoperability challenges. Information. 2025;16(11):932. Available from: https://doi.org/10.3390/info16110932
[27] Oyinna B, Udo PD, Nurhidayat I, Muslimyar AR. Integrating data processing and advanced analytics for scalable knowledge discovery in complex data environments. Journal of Computing and Data Technology. 2025;1(2):115–120. Available from: https://doi.org/10.71426/jcdt.v1.i2.pp115-120
[28] Bongu SR. Real-time behavioral biometrics and continuous authentication framework for secure financial transaction ecosystems. Journal of Applied Sciences and Modelling. 2025 Dec 31:40–50. Available from: https://doi.org/10.71426/jasm.v1.i1.pp40-50
[29] Gomez-Cabrero D, Abugessaisa I, Maier D, Teschendorff A, Merkenschlager M, Gisel A, Ballestar E, Bongcam-Rudloff E, Conesa A, Tegnér J. Data integration in the era of omics: Current and future challenges. BMC Systems Biology. 2014 Mar 13;8(Suppl 2):I1. Available from: https://doi.org/10.1186/1752-0509-8-S2-I1
[30] Doan A, Halevy A, Ives Z. Principles of Data Integration. Elsevier eBooks. 2012. Available from: https://doi.org/10.1016/c2011-0-06130-6
[31] Rajesh M, Vengatesan K, Sitharthan R, Dhanabalan SS, Gawali MB. Enhancing mobile multimedia trustworthiness through federated AI-based content authentication. Journal of Mobile Multimedia. 2023 Nov;19(6):1415–1437. Available from: https://ieeexplore.ieee.org/abstract/document/10972375
[32] Gurunath R, Samanta D, Goutham YG. Progressions and unfilled gaps in homomorphic encryption for emerging application areas: A comprehensive literature review and preface. IoT Security. 2026 Jan 1:333–357. Available from: https://doi.org/10.1016/B978-0-443-34125-0.00011-8
[33] Soma AK. Hybrid RNN-GRU-LSTM model for accurate detection of DDOS attacks on IDS dataset. Journal of Modern Technology. 2024 May 14;2(1):283–291. Available from: https://doi.org/10.71426/jmt.v2.i1.pp283-291
[34] Fetaji B, Samanta D. Hard real-time deep learning for security-critical streams: An integrative algorithmic framework for cybersecurity, scientific analytics, and decision making. SN Computer Science. 2026 Apr 18;7(5). Available from: https://doi.org/10.1007/s42979-026-04968-9
[35] Ouariach S, Ouariach FZ, Ouaissa M, Ouaissa M. Artificial intelligence applications in cybersecurity. In: AI-Driven Cybersecurity. 2025:32–62. Available from: https://doi.org/10.1201/9781003631507-2
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Sarvendra Aeturu (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.