You are here

Data Source Descriptions

Query optimization relies on histograms, data structures that approximate data distribution, in order to be able to apply their cost model. Histograms can be constructed by scanning the database tables and aggregating the values of the attributes in the table; and similarly maintained in the face of database updates.This histogram lifecycle, however, cannot be efficiently applied to large-scale and frequently updated databases, such as, for example, stores of sensor data. An alternative approach is taken by adaptive query processing systems that update their histograms by observing and analysing the results of the queries that constitute the client-requested workload, as opposed to maintenance workload only for updating the histograms. The relevant databases literature focuses on numerical attributes, exploiting the concept of an interval as a description of a set of numerical values that is succinct and that has a length that can be used to estimate the cardinality of many different intervals that have roughly the same density.

SemaGrow extended adaptive query processing so that it can be applied to the domain of strings, which in the databases state of the art were treated as purely categorical symbols that can only be described by enumeration. This approach disregarded the fact that there are several classes of strings that have an internal structure and that can be handled in a more sophisticated manner. In URIs, for example, prefixes can express sub-spaces of the overall string space that are interesting from the point of view of providing query optimization statistics. Although weaker than regular expressions, prefixes can be very efficiently applied and can capture interesting ranges in URIs and other hierarchically-structured string domains.

Naturally, this attention on URIs is motivated by their prominent position in the Semantic Web and Linked Data infrastructures for publishing data. In fact, these paradigms motivate adaptive query processing for a further reason besides the scale of the data: distributed querying engines often concentrate loose federations of publicly-readable remote data sources over which the distributed querying engine cannot effect that histograms are maintained and published. Furthermore, the URIs of large-scale datasets are not hand-crafted names but are automatically generated following naming conventions, usually hierarchical. These observations both motivate extending adaptive query processing to Semantic Web data stores and also present an opportunity for the SemaGrow string prefix extension. SemaGrow work also formalized the key concepts in STHoles in a way that subsumes STHoles as its specialization for numerical intervals; and provided an extension to string prefixes.

Work in SemaGrow extended VoID into Sevod, an RDF vocabulary that is expressive enough to represent detailed instance-level metadata akin to relational database histograms. This improves cost estimation, but Sevod models are practically impossible to maintain manually. To address this, SemaGrow extended and adapted to Semantic Web data methods from the relational database literature for adaptive query processing, i.e., into a method for maintaining histograms by analysing the results to the user-requested query workload. In this manner, SemaGrow builds and maintains its Sevod descriptions without imposing any extra overhead queries and without requiring access to data dumps.

Although the research above has given us considerable advances towards handling the needle-in-a-haystack problem of identifying the best strategy to retrieve data out of a large-scale repository, intelligent execution planning does not improve retrieving large amounts of data and also breaks down when remote endpoints are unexpectedly slow or unresponsive. To efficiently combine query results at a large scale and to handle the dynamic nature of the LOD cloud, we are carrying over and adapting optimized query execution methods from relational database research.