Abstract of MS Thesis - Kuldeep Reddy

Monday 09 April 2012 at 11:25 pm.

The web of linked data represents a globally distributed dataspace that allowspublishing and connecting structured data across the web by creating typed linksbetween data from different sources. The data published on the web is machinereadable, that is, its meaning is explicitly defined. It can be queried with theSPARQL query language whose execution takes place, according to the bottom-upapproach, by asynchronously traversing the RDF links to discover data sourcesat run-time. The optimization of SPARQL queries over the web of dataremains a challenge and in the first part of the thesis an approach addressing thisproblem is presented. The proposed approach works in two-phases to optimizethe SPARQL queries. The first phase analyzes the query before its execution anddiscovers classes of data that do not contribute towards answering it which canthen be prevented from being fetched. However, this analysis may miss a numberof patterns that can only be discovered at run-time. The second phase analyzesthe query execution to discover more patterns which are used to further reducethe amount of data fetched from the web to answer the query.

With the growth in size and complexity of the web of linked data, itbecomes impractical for the user to know enough about its structure and semanticsfor the user queries to produce enough answers. This problem is addressed in thesecond part of the thesis by making use of ontologies available on the web of linkeddata to produce approximate results. The existing approach, which generates mul-tiple relaxed queries and executes them sequentially one by one, is improved byintegrating the approximation steps with the query execution itself. Thus, by per-forming query relaxation on-the-fly at runtime, the shared data between relaxedqueries are not fetched repeatedly, resulting in significant performance benefits.Further opportunities for optimization during query execution are identified andare used to prune relaxation steps which do not produce results. The implemen-tation of the proposed approach demonstrates its benefits. In order to the ensurequality of results given the prevalence of unreliable data on the web of linked data,the component of trust is introduced to rate the results in addition to the similaritymetrics. The trust scores are computed for the triples by detecting inconsistentinformation in the description of the entity. Using the trust scores, further efficiency is achieved by pruning unreliable data at an early stage in query executionitself.

The third part of the thesis proposes a distributed approach to process SPARQLqueries, where the query itself is routed amongst sources providing SPARQLendpoints as opposed to the existing approaches which retrieve the data andexecute the query locally over it. To decide whom to forward the queries to, itmakes use of service descriptions of SPARQL endpoints, which basically describethe data present in the source. So, each source will execute the triple patternmeant for it, match the bindings produced with the constraints present in theservice descriptions of the sources and if there is a match forward the query tothe source and the process repeats itself till the results are produced. To performapproximate querying of the Web of Linked data with the distributed approach,the relaxation of query conditions is also distributed amongst the sources. Eachsource now has the responsibility to relax the triple patterns it is concerned withusing the available ontologies, execute them and forward the relaxed queries usingthe service descriptions of the sources.