Graph Databases

Graphs are a flexible and agile data model for representing complex network structured data used in a wide range of application domains, including social networks, biological networks, and knowledge management. Graphs are, by nature, that can leverage interconnectedness to represent, explore, predict, and explain real- and digital-world phenomena.

Today, we are witnessing an unprecedented growth of interconnected data, which underscores the vital role of graph processing in our society. As pointed out by a recent CACM4 article of key persons in data management, “the future is big graphs”. Instead of one single, compelling (“killer”) application, we see big graph processing systems underpinning many emerging, but already complex and diverse data management ecosystems, in many areas of societal interest. Indeed, the effects of such growth are evident, and big data processing systems are already applied to many sophisticated graph data management tasks. For instance, the timely Graphs COVID-19 initiative is evidence of the importance of big graph analytics in understanding the pandemic. To address the growing presence of graphs, academics, start-ups, but also big tech companies such as Google, Facebook, and Microsoft, have introduced various systems for managing and processing big graphs. Currently, diverse models and systems populate a fragmented market and a lack of clear direction for the community. According to the aforementioned CACM paper, the Resource Description Framework (RDF) and Property Graph (PG) are the most prominent data models for graph data management. The former is a W3C recommendation to serve data sharing and interoperability scenarios. Indeed, RDF is at the core of the Linked Data and FAIR data initiatives which originated within the Semantic Web community. The latter, which is the focus of this action, emerged in the context of enterprise data management.

Property graphs are multigraphs where nodes and edges can have labels and properties (i.e., key-value pairs). The model is becoming very popular and widespread: PG solutions now serve 75% of the Fortune and Gartner predicts that by 2025, graph technologies will be used in 80% of data and analytics innovations. Note that, at the foundational level, all the models underlying graph database systems are subsumed by the PG model. PG’s popularity in the industrial community is justified by the fact that its development was picked by the main international standards body, namely ISO (International Organisation for Standardisation). However, diverse languages and systems for PG processing and analysis populate a fragmented market thus causing a lack of clear direction for the research and industrial communities.

Graph Databases that adopt property graphs, e.g., Amazon Neptune, Neo4j, Oracle, SAP, TigerGraph, enable graph access via non-declarative APIs, such as Gremlin or, in the style of traditional relational databases, declarative languages, such as Cypher, PGQL, and GSQL. An upcoming graph query language standard from ISO, called GQL, aims to unify these declarative languages, in a way SQL did it for relational databases. The first version of the GQL Standard is scheduled to appear in early 2024, but it will have a number of important omissions. Two most notable omissions are support for sophisticated graph schemas and complete lack of support for graph-to graph transformations. Indeed, GQL currently allows writing only graph-to-relational queries, which are adequate for many scenarios, though not all. For example, in data analytics tasks that require extensive data exploration, a user expects to see graphs as query answers. To cater to these needs, existing PG solutions provide various ad-hoc tools (basic visualization, library functions for limited graph projection, etc.). However, a proper graph query language must treat graphs as first-class citizens, and support, among others query compositionality. In fact, treating graphs as first-class citizens is a stated goal of GQL design. However, compositionality has been dropped from the first version of GQL for a simple reason: there is no underlying research telling us how to add such facilities to the language.

Graphs constantly need to be transformed, to be updated with new information, and to be moved between applications. Our state of understanding of such transformations is very preliminary (as is indicated, for example, by multiple issues existing in the updating facilities of the leading graph query language Cypher). Crucially, we completely lack the framework for checking correctness of such transformations, such as adherence to the typing information or being compatible with the requirements of an application that uses the output of a graph query.

The lack of research underpinnings for a proper graph-to-graph language is currently limiting the development of essential features like views, subqueries, and updates in graph query languages. Indeed, the first specification of the GQL industrial standard, which will appear in the first half of 2024, will still be a graphs-to-relations language borrowing its engine – pattern matching – from its purely relational counterpart SQL/PGQ. As such, a concentrated effort is required in order to understand and ultimately unlock the aforementioned features. In particular, TALE will focus on sophisticated types of graph modifications that we currently do not know how to perform safely and how to incorporate schemas into querying, which is yet another aspect of relational databases that is well understood and commonly used that requires much new foundational research for graphs. FORTH already posseses significant experience in formalizing languages for graphs (e.g. ICS-FORTH RDFSuite and RQL) which will be capitalized to this purpose.