Serving Sophisticated Ad Hoc Information Needs Based on Beforehand Unknown, Autonomous, & Heterogeneous XML Data Source

Acta Universitatis Tamperensis No. 1878


By: Turkka Näppilä
March 2014
Tampere University Press
Distributed by Coronet Books
ISBN: 9789514492846
178 Pages
$97.50 Paper original


In this dissertation, we study serving of the user's sophisticated ad hoc information needs based on beforehand unknown, autonomous and heterogeneous XML data sources. Serving sophisticated information needs that concern comparing and analyzing products, places, individuals, and so on, often requires that the underlying data are combined from disparate and independently organized data sources. In other words, data integration is needed. However, when the information needs at hand are ad hoc or short-term in nature, the use of traditional data integration techniques is often not an option because of the time and resources their effective application requires. Moreover, they also typically presume considerable technical expertise from their user as well as intimate familiarity with the underlying data sources.

The dataspace approach has been introduced to overcome the limitations of the traditional data integration approaches when operating in modern information environments. A dataspace is a collection of data sources intended to provide all of the information relevant to a particular user or task, regardless of the format or the systems and interfaces through which they are accessed. Data integration in dataspaces follows the so-called pay-as-you-go principle. The idea is that the user can be provided with useful services without the need for full integration between the underlying data sources from the start, but the integration is tightened gradually as the user interacts with the underlying dataspace over the time. The work presented in this dissertation can be characterized as a dataspace-oriented approach.

While dataspace systems are supposed to support multiple data formats in parallel, our starting point, however, is that the corresponding applicability can be achieved by adopting XML (Extensible Markup Language) as the general data format and model. There are several reasons that support this decision. Most importantly, XML, being based on a semistructured data model, provides a platform-independent means to represent information, and it is today the de facto standard for exchanging both unstructured documents and structured data.

This dissertation presents a four-phased framework for describing how the user's sophisticated ad hoc information needs are served based on beforehand unknown, autonomous and heterogeneous XML data sources. Within this framework, we contribute several novel methods, techniques, and tools (i) for searching potentially useful data sources; (ii) for assessing their actual suitability with respect to the task at hand as well as their mutual consistency; (iii) for overcoming and removing the possible inconsistencies in them; and, finally, (iv) for satisfying the user's information needs through a powerful query language, in which data integration is seamlessly combined with typical data-centric manipulation.