Data Integration

Data Integration is a well-established research area with tangible results for relational databases. However the PG data model significantly differs from the relational one especially due to the fact that graph instances can be defined without a priori schema. While entity alignment has been studied for knowledge graphs in other data models (e.g. RDF) by leveraging ontologies and RDF types, schema-based PG integration is largely unexplored, due to the lack of definitions of schemas and constraints for this data model. In this direction, methods for schema discovery, which are necessary for establishing mappings and transformation rules between multiple PGs, are currently underway. Property graph schemas are being defined as part of the LDBC standardisation activities and these definitions are expected to be adopted by graph database vendors. Schema inference methods can be used to extract standard schemas from PGs and use them to specify mappings across different PGs. Similarly, an appropriate mapping language for data exchange and transformation is still not present and should be defined on top of standard graph query languages. The correctness and the validity of the generated mappings is critical for processing tasks and, thus, there is a need for methods to ensure that the available mappings represent the intended transformations appropriately. Finally, the integration of PG with other graph data models (e.g. from/to RDF) is relevant, with appropriate characterization of cases where loss of information might occur.

Graphs provide a very flexible data model that makes it appropriate for integrating data from multiple disparate sources. Inspired by the work done for relational databases, TALE will tackle several foundational issues for integrating PGs, starting by adapting to PGs the well-known three-layered architecture of relational data integration systems, i.e., sources to be integrated, the target providing an integrated view over the sources, and the mappings establishing the relationship between the sources and the target. In this direction, TALE will carry both a thorough investigation of the complexity of basic foundational services of graph data integration; it will investigate the definition of a mapping language based on existing standards, capitalizing experience accumulated over the years for mapping languages designed by the FORTH team (i.e., X3ML23), establishing the formal underpinnings of graph data integration and exchange. TALE will consider both schema-based and schema less mappings: for the former, it will explore methods for discovering the schema of PGs that will be used for establishing mappings and transformations, for the latter different techniques will have to be devised. Finally, mappings between other data models (i.e. RDF) will be explored, both for the sources and the target, with appropriate characterization of cases where loss of information might occur.

This task will initially explore schema extraction for PGs and then will investigate how to capitalise schema for integrating data from multiple heterogeneous sources. As such a mapping language will be defined that will enable both schema-based and schema-less mappings and the corresponding data transformations. The formal semantics of the language will be established and tools for performing the integration and the transformations will be delivered. Thus, it will investigate issues concerning both virtual data integration, where the target is virtual in the sense that the actual integration is performed “on the fly”, e.g. when performing data analytics, and materialized data integration, also called data exchange, where the target is a PG that is computed and accessed when needed.