Efficient query processing depends on the construction of an efficient query plan to guide query execution. More specifically, given a query in a declarative language such as SPARQL, there are many plans that a query processing system can consider to retrieve the answer. All those plans are equivalent in the sense that they return the same answer, but they can vary by orders of magnitude in their cost in computational resources and time.
The implication is that detailed instance-level metadata about the data sources can be used to estimate cost and select among alternative plans. Although this is true of any query processing system offering a declarative query language, SemaGrow deals with federated querying, where additional challenges with respect to conventional and even distributed stores are present: distributed triple stores exercise full control over a cluster of storage nodes to efficiently scale out without exposing to client applications the internal organization of data among the nodes of the cluster. Federated querying, on the other hand, accommodates the more dynamic combination of query endpoints that are maintained and populated independently of their membership in one (or multiple) federations. In fact, federated endpoints need not even be aware that some of their clients are not end-applications but federation services and that the queries they serve are executed in the context of a more complex federated query.
This dynamic, loose integration of endpoints is more characteristic of the kind of scalability envisaged for the Semantic Web and the Linked Data cloud than the tighter integration of centralized distributed stores. But it leaves open questions in query optimization, since it introduces factors that are not relevant to single-node or tightly integrated distributed databases and triple stores: that endpoint metadata might not be available or not be reliable; endpoints might get updated, might be temporarily unavailable; or might have varying cost parameters depending on load other than the load generated by the federation service.
Two systems that stand out among the federated querying literature for their low query processing time, are also characteristic of different approaches to efficient federated querying. The FedX system plans query execution without relying on any metadata about the endpoints it federates. FedX issues low-cost probing queries to identify candidate data sources among the federated endpoints and then applies coarse heuristics to optimize the query plan. Although this algorithm can potentially produce suboptimal plans, it introduces minimal overhead to query execution. Together with a very efficient execution engine, this strategy makes FedX extremely fast executing low-cost queries and, more generally, in situations where more sophisticated optimization has little impact. SPLENDID, on the other hand, employs data source metadata to estimate the cost of executing query plans in order to select the most advantageous one to execute. This introduces query planning overhead and the requirement that such metadata is available. To counter-balance this requirement, SPLENDID relies on endpoint descriptions using the Vocabulary of Interlinked Datasets (VoID), an independently motivated and widely used vocabulary for describing linked datasets.
Although SPLENDID is generally slower than FedX, its plans can sometimes require an order of magnitude less data to be fetched from the remote endpoints in order to prepare the same final results. Besides the execution efficiency improvement, this is particularly important when federating publicly available endpoints that often apply a limit in the number of rows in a result set. The observation that optimized planning often does not pay off but can sometimes have a huge impact has prompted us to investigate how to optimize query plans while at the same time performance remains competitive despite the optimization overhead in situations where optimization does not yield considerable performance improvements.
SemaGrow balances between a query optimizer that introduces little overhead, has appropriate fall backs in the absence of metadata, but at the same time produces optimal plans in most (although not theoretically all) possible situations. The SemaGrow execution engine is as efficient as that used by FedX, but with the added benefit of being based on state-of-the-art stream processing technologies to achieve to operate in an asynchronous and non-blocking way.