The main objective of this work is to design a technological stack, based on solid theoretical underpinning, around the property graph data model that can enable graph processing and analytics at scale. The aim has largely escaped the attention of the academic community despite its prominence in practical applications. In particular, such a stack shall ensure correctness of graph specifications, graph queries, and graph transformations. The overarching aims of this work is to capitalize expertise and technology that is developed for decades at FORTH for semantic graphs, such as those of graph databases, semantic web technologies, foundations of databases and data models, big data systems, as well as graph algorithms and techniques and to generate a whole new field of interest the scientific community focusing on Property Graphs.

To this extent, this work will progress along four research axes/sub-objectives:

  • Declarative graph language. Even though the specification of GQL v1 will appear early in 2024, it will still be a graphs-to-relations language borrowing its pattern-matching engine from its purely relational counterpart SQL/PGQ. The key reason the design committee is not yet ready to move to the next versions of GQL as a proper graph-to-graph language is the lack of research underpinnings of such languages. Therefore, a key objective is to establish the theoretical foundations of a fully- fledged graph language that is ready for commercial adoption.
  • Schema Discovery for Property Graphs (PG). Although early approaches on schema discovery have started to appear there already exists a significant amount of work performed in semantic graphs which can be adapted and leveraged for PGs. This is still an unexplored area and will be the key to enable schema-based data integration for PGs.
  • Data Integration for Property Graphs (PG). Enabling data integration for property graphs is the third objective of this proposal. Data integration is a well-studied research area with tangible results for relational databases. However, techniques for PG integration are currently non-existent despite the fact that the PG data model significantly differs from the relational one and that more and more data are becoming available and necessitate methods for their effective and efficient integration.
  • PG data processing systems at scale. Advancing the field by identifying challenges that impede the widespread use of graph data management and processing systems at scale, as well as promising solutions to overcome them. Indeed, although systems for graph data management and processing can count on prominent industrial adoption, we still witness a gap in their adoption for business intelligence use-cases and analytical tasks.

TALE will also design and implement a novel infrastructure focusing on the four main phases of the project, as shown in Figure 1.

Figure 1: A novel infrastructure of TALE project
  • Design: The design methodology will involve a detailed survey on systems and languages focusing on property graphs, identifying bottlenecks of existing solutions and setting the path for the following work in this project. The survey will focus on existing languages and relevant approaches for schema discovery, identifying also applicable approaches from semantic graphs. The survey will also include currently running EU projects in the area and a thorough analysis of the results of past projects. PG processing infrastructures will be examined and experiences from past projects will be directly transferred to TALE.
  • Development: The development phase will focus on individual areas, i.e. PG schema discovery focusing on extracting a structure that can be used as a schema for PGs, how to exploit those schemas for establishing declarative mappings between multiple PGs leading eventually to PG Data integration, and finally on schema-based data partitioning for efficient query answering over big PG.
  • Evaluation Phase: During this phase, the individual components will be internally tested and evaluated. The results of this phase will drive the final version of the software all components will be integrated. The final version will be evaluated holistically evaluating how multiple PGs can be integrated exploiting schema information and enabling efficient query answering on top. The outcome will also influence the business/exploitation plan especially shedding light in the added value of the developed technologies.
  • Exploitation: This phase spans throughout the project and after completion continues via the sustainability plan and the release of services in an open-source repository. Also, project results will be shared with the industry and through existing channels (e.g. Oracle Labs) and will also be communicated through a COST action in relevant fields co-organized by FORTH submitted in parallel with this proposal.