natanael.arndt.xyz: Decentralized Collaborative Knowledge Management using Git (Extended Abstract)

Abstract

Collaboration of people and machines is a major aspect of the World Wide Web and as well of the Semantic Web. As a result of the collaboration process, structural and content interferences as well as varying models and contradictory statements are inevitable. Currently the collaboration on Linked Data Sets is mainly done by keeping a central version of a dataset. This central approach for a synchronized state has drawbacks in scenarios in which the existence of different versions of the dataset is preferable. Furthermore, the evolution of a dataset in a distributed setup is not necessarily happening in a linear manner. We present a system that fosters the evolution of a dataset in a distributed collaborative setup and supports divergence of datasets as asynchrony and dissent; reconcile diverged states of datasets; and synchronize different distributed derivatives of the dataset. The data is kept in a distributed version control system with support to branch, merge, and synchronize distributed RDF datasets. Each version can be queried and updated via a standard SPARQL 1.1 Query & Update interfaces as well as the related provenance information. The system allows to build knowledge engineering processes similar to well established methods from the software engineering domain.

Introduction

Apart from documents, datasets are gaining more attention on the World Wide Web. An increasing number of the datasets on the Web are available as Linked Data, also called the Linked Open Data Cloud or Giant Global Graph. Collaboration of people and machines is a major aspect of the World Wide Web and as well of the Semantic Web. Currently, the access to RDF data on the Semantic Web is possible by applying the Linked Data principles, and the SPARQL specification, which enables clients to access and retrieve data stored and published via SPARQL endpoints. RDF resources in the Semantic Web are interconnected and often correspond to previously created vocabularies and patterns. This way of reusing existing knowledge facilitates the modeling and representation of information and may optimally reduce the development costs of a knowledge base. As a result of the collaborative reuse process, structural and content interferences as well as varying models and contradictory statements are inevitable.

Projects from a number of domains are striving for distributed models to collaborate on common knowledge bases. In the domain of e-humanities projects often come with a need to explore and track provenance and the evolution of the domain data [RMA⁺10, RB16]. In the context of managing historical prosopographical data, the source of the statements is relevant to evaluate their credibility and to consider the influence of their environment. In libraries, metadata of electronic library resources are gathered and shared among stakeholders to collaboratively curate and manage the resources as Linked Data [ANN⁺14, NAR⁺14]. In a collaborative data curation setup the origin of any statement needs to be identified in order to be able to track back the conclusion of license contracts and identify sources of defective metadata. But even enterprises have a need to manage data in distributed setups to organize the communication of data along supply chains or business processes [FATP16].

Distributed systems such as the Solid platform as an advancement of the architecture of a distributed semantic social network provide possibilities to collaborate in a distributed network. Nevertheless, the subject of collaboration is currently kept in a central place where all contributions are incorporated; the organization of a fully decentralized collaboration process is still subject to future work. In general, currently the collaboration on Linked Data Sets is mainly done by keeping a central version of a dataset. The systems available to collaborate on Linked Data are central SPARQL endpoints and Wiki systems where collaboration happens on a single, shared instance. This central approach for a synchronized state has drawbacks in scenarios in which the existence of different versions of the dataset is preferable. Furthermore, the evolution of a dataset in a distributed setup is not necessarily happening in a linear manner. Multiple versions of a dataset occur if the participants do not all have simultaneous access to the central dataset. If a consensus on the statements in a dataset is not yet reached, multiple viewpoints need to be expressed as different versions of the dataset. Hence, a system that fosters the evolution of a dataset in a distributed collaboration setup needs to support divergence of datasets as asynchrony and dissent; reconcile diverged states of datasets; and synchronize different distributed derivatives of the dataset. As a consequence of the reconciliation we also needs to identify possible occurring conflicts and contradictions, and offer workflows to resolve identified conflicts and contradictions. The dimensions of consensus vs. dissent and synchronicity vs. asynchrony are depicted in fig. 1. While the dissent-dimension comes with the collaborative character, asynchrony is introduced due to the distributed conception of our setup. Both of the dimensions can lead to a diverged state of a dataset in a collaborative curation scenario.

In the early days of computers, the term software crisis was coined to describe the immaturity of the software engineering process and software engineering domain. The process of creating software could be made more reliable and controllable by introducing software engineering methods. Version control is an important aspect to organize the collaborative evolution of software. Early version control systems (VCS), such as CVS and Subversion, allowed central repositories with a linear version history to be created. Distributed VCS (DVCS), such as Darcs, Mercurial, and Git, were developed to allow every member of a distributed team to fork the current state of the programs source code and individually contribute new features or bug-fixes as pull-requests. Learning from software engineering history where DVCS have helped to overcome the software crisis, we claim that adapting DVCS to Linked Data is a means to support decentralized and distributed collaboration processes in knowledge management. The subject of collaboration in the context of Linked Data are datasets instead of source code files. Similar to source code development with DVCS, individual local versions of a dataset are curated by data scientists and domain experts.

In our previously published paper [ANR⁺18] we present Quit Store, it was inspired by and it builds upon the successful Git system. The approach is based on a formal expression of evolution and reconciliation of distributed datasets. It provides support to branch, merge, and synchronize distributed RDF datasets. During the collaborative curation process, the system automatically versions the RDF dataset and tracks provenance information. The provenance information is expressed in RDF using PROV-O and can be accessed through a dedicated SPARQL 1.1 endpoint. To version the data, the system relies on the pure RDF data model and not on support for additional semantics such as OWL or SKOS. To support distributed collaboration we propose a methodology of using a Git repository to store the data in combination with a SPARQL 1.1 interface to access it. The SPARQL 1.1 interface provides an integration layer to make the collaboration features of Quit accessible to applications operating on RDF datasets. Most recently we have extended the Quit system with the Quit Editor Interface Concurrency Control [AR19] to support editors in managing overlapping operations. To reconcile diverged datasets a merge process is provided. The merge process is guarded by the specific merge strategies for RDF data: Union Merge, All Ours/All Theirs, Three-Way-Merge, and Context Merge. This setup can enable complex distributed collaboration strategies. As there is a big ecosystem of methodologies and tools around Git to support the software development process, the Quit Store can support the creation of such an ecosystem for RDF dataset management.

Acknowledgements

This work was partly supported by a grant from the German Federal Ministry of Education and Research (BMBF) for the LEDS Project under grant agreement No 03WKCG11C the Federal Ministry for Economic Affairs and Energy (BMWi) for the PlatonaM project under the grant number 01MT19005A, and the DFG project Professorial Career Patterns of the Early Modern History: Development of a scientific method for research on online available and distributed research databases of academic history under the grant agreement No 317044652.

natanael.arndt.xyz

Abstract

Introduction

Acknowledgements

References