You are here

Managing Heterogeneity

Besides integrating large-scale data sources, SemaGrow also tackles integrating heterogeneous data sources. This pertains to efficiently applying resource mappings within the query execution engine of the SemaGrow Stack, but also to a system of tools that complement the SemaGrow Stack and that provide the mappings that should be applying. While the first period of the project focused on developing technologies that support robust alignment over heterogeneous knowledge models and languages (MAPLE, Lime) and on refining the SYNTHESIS system for automatic alignment, the second period focused on robustness and integration. Specifically, SYNTHESIS has been provided with an appropriate data access layer, moved to a thread-based architecture, and provided with ontology modularization methods to improve scalability. Furthermore, the architecture of the VocBench environment for manual alignment and of its backing framework, Semantic Turkey, has been reworked to allow for multi-project and multi-graph management, dynamic context injection inside the services, seamless navigation of local and Web data laying out a common ground for a user-centric ontology alignment experience. Semantic Turkey also features a new Linked Data explorer that will be integrated into VocBench during the final period of the project.

Finally, during its second period the project experimented with the CODA system for knowledge extraction and transformation, applying it to a datasheet import scenario. The objective is to strike a balance between enforcing conventions (affording ease of use) and capacity to deal with complexity. This effort resulted in Sheet2RDF, a datasheet to RDF import and transformation system, which by following a “convention over configuration” approach makes so that the “evident” and trivial imports can be dealt in an almost automatic way, whereas more complex transformations can be still managed through the more powerful capabilities of CODA.

Both the alignment and knowledge acquisition systems developed, represent an “offline” contribution to the stack, i.e. integration is guaranteed at standard compliancy level. We detail here the flow of data and the standards being adopted:

Alignment:

  • SYNTHESIS provides results (i.e. alignments) according to the Alignment Ontology adopted by INRA’s Alignment API framework: a de facto standard – currently adopted in most OA environments – that, by reifying alignments, provides more insightful information about them such as reliability of the alignment, validation etc.
  • Semantic Turkey’s Alignment Validator manages the Alignment Ontology’s format and extends it with further metadata for the alignments, facilitating backup and recovery of uncompleted human validation processes of alignments. It also allows to project alignments into plain mapping triples according to existing alignment vocabularies (OWL and SKOS mapping properties)
  • The output of Semantic Turkey’s Alignment Validator (plain mapping triples) are then ingested by VocBench, and collaboratively maintained there.

Triplification/Evolution:

  • CODA (and its specific application Sheet2RDF developed entirely in the context of SemaGrow) are able to process unstructured content. The general case of CODA covers any type of content, and is then narrowed by specific applications developed on top of it (e.g. in the case of Sheet2RDF this is tabular data according to different formats, such as Microsoft Excel, open formats such as ODS, and less descriptive formats such as CSV and TSV). The output format is RDF which, following specific needs, can be modelled according to different vocabularies. The result of this processing can be used in many scenarios:
    • Triplifying existing data in order to generate new Datasets.
    • Triplifying data that is continuously produced, in order to update existing datasets. In this case, the clear separation between the acquisition process and the triplification allows for easily update the data projection by easily following evolutions in the vocabularies of the target dataset.