In the 1990’s a novel conceptual computing approach was introduced by Ian Foster: “Grid computing has emerged as an important new field, distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation.”. Over the course of the last few years, researchers have been working hard to analyze the potential of Grid computing for Spatial Data Infrastructures (SDI). Is it possible to utilize the enormous computing power of a grid for processing tasks from the spatial domain?
Introduction
In the 1990’s a novel conceptual computing approach was introduced by Ian Foster: “Grid computing has emerged as an important new field, distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation.”. Over the course of the last few years, researchers have been working hard to analyze the potential of Grid computing for Spatial Data Infrastructures (SDI). Is it possible to utilize the enormous computing power of a grid for processing tasks from the spatial domain?
Are the standards and specifications for data retrieval and processing from the SDI domain compatible to the concept of a parallelized, gridified computing and storage infrastructure? While it became apparent it is generally possible to combine SDI services and Grid infrastructures, there are some conceptual gaps that will have to be addressed in the process. In this article some of them are investigated and an overview of the GDI-Grid project, a research project funded by the German Ministry for Research and Education, will be given.The resources of conventional SDIs are limited when accessing Earth Observation data inventories or data bases of national and international data centers. This is especially true, whenever large-scale data sets have to be processed.
For example the calculation of a continental vegetation index requires Map Algebra on multiple tera bytes of data, a task which cannot normally be handled by local workstation computers. Other tasks that exceed the resources of conventional SDIs by large margins, due to the highly complex algorithms or the huge amounts of input data involved, include the generation of climate models or accurate simulations of noise propagation in precisely modeled cities.
A Grid infrastructure’s capability to present users with the processing power of thousands of Central Processing Units (CPU) and the ability to store vast amounts of data make it an ideal platform to perform these tasks in an acceptable amount of time and with the much-needed reliability. The potential advantages include :Means to share storage and computing resources between the members of a so-called virtual organization, therefore reducing the initial costs of hardware acquisition Means to react to an increasing demand for storage and computing resources on the flyMeans to handle all transactions securely and reliably.
Differences between SDIs and Grid Infrastructures
Because the concepts of conventional SDIs and Grid infrastructures differ significantly, an actual implementation of a grid-enabled SDI (here: a set of conventional OGC-compliant SDI services attached to a Grid infrastructure back-end) has to address several incompatibilities. Especially service descriptions, service interfaces, service states and security mechanisms are handled in fundamentally different ways.
Service descriptions
Grid services always come with a Web Service Description Language document (WSDL, www.w3.org/TR/wsdl) describing the services methods and input parameters. OGC web services on the other hand are described using Capabilities documents as well as service-specific metadata for different operations (e.g. DescribeFeatureType, DescribeCoverage etc.). WSDL documents and Capabilities documents differ significantly and are not easily converted into each other. To deploy an OGC web service in a Grid infrastructure it is necessary to create a WSDL description manually. At the moment there is no method for doing this automatically.
Service interfaces
The set of operations an OGC Web Service supports are defined in OGC service specification documents. There are different ways to invoke these operations, but the preferred ways are key-value-pair requests via HTTP-GET as well as requests encoded in XML documents sent via HTTP-POST. Additionally most of the recent service specifications include instructions to utilize SOAP for invoking an operation. Grid services are usually addressed through a Grid middleware. The Globus Toolkit 4 for example delivers service calls using SOAP. Services that do not support SOAP or have no WSDL-description may not be integrated in a Grid workflow.
Service states
Apart from using SOAP for communication and WSDL for description, some Grid services implement the Web Services Resource Framework (WSRF) developed by the Organization for the Advancement of Structured Information Standards (OASIS). While conventional web services (i.e. non-grid services), including OGC web services, are stateless, the WSRF enables a service to manage state informations between service calls. These values are stored as resources at a so-called service endpoint. Every service endpoint has its own Uniform Resource Identifier (URI) that may be used to access the stored informations. Storing intermediary results necessary for later calculations is a common use case for stateful services.
OGC specifications (apart from the Web Processing Service specification or WPS) do not yet include any instructions regarding service states. An optional part of the WPS specification introduces a request parameter for storing the results of a process at an external resource, but as this part of the specification is optional it is not widely supported at the moment.
Security mechanisms
OGC specifications don’t include any statements regarding security issues yet. Securing the transport protocol is usually done by using secure HTTP (HTTPS). Furthermore, there is no specification on how to authenticate different users at a service. Thus, security mechanisms in conventional SDIs are established project-specifically and different vendors handle security in different ways. For Grid infrastructures on the other hand this low level of security does not suffice.
It is paramount, that every resource access can be assigned to exactly one particular user, as the vast amounts of computing power and storage capacity of Grid infrastructures possess a high potential for misuse. Therefore, security is a significant element of Grid infrastructures (Foster 1998). If OGC web services are to be used in Grid infrastructures, they have to provide the means to perform user authentication as well as encryption for communicating with Grid resources.
Case Study: The GDI-Grid project
The GDI-Grid project (GDI being the German acronym for spatial data infrastructure) is a national research project funded by the German Ministry for Research and Education, the BMBF. The project’s goal is to merge the components of a standard-compliant SDI with a Grid infrastructure and solve the aforementioned compatibility issues. Combining these technologies serves two purposes:
First and foremost it will enable users of SDI-technologies to access the superior storage and compute resources of a Grid infrastructure in a standardized way. Furthermore, it will enable users of Grid technologies to integrate geospatial service calls into Grid workflows, containing a whole string of Grid service invocations.
GDI-Grid Objectives
One part of the project aims at adapting the architectural building blocks of spatial data infrastructures and Grid infrastructures to enable communication between them. By combining the knowledge of members of the geo-spatial community and members of the Grid community the challenges of matching both worlds are addressed from two perspectives: examining how to extend Grid components to address OGC Web Services and identifying ways how to modify OGC Web Services to enable them to communicate with Grid resources. The proposed architecture for a grid-enabled SDI developed in the GDI-Grid project is shown below (see figure 1).
Further topics addressed include the creation and validation of automated Grid workflows, as well as developing methods for data processing using Grid compute resources. Integration of data from different sources, generalization and enrichment of data are some of the processing steps that are implemented providing access to Grid resources through an OGC-compliant service interface (i.e. Web Processing Service).
Possible use cases for the research results are addressed within three scenarios. The scenarios represent tasks that could also be fulfilled by conventional SDIs thus being adequate for the determination of potential advantages of Grid based SDIs.
Scenario 1: Noise propagation
Calculation of noise propagation is a valuable instrument for urban planning and due to the EU-directive 2002/49/EC a necessity when constructing highways or railways in European countries. The scenario aims at transferring existing workflows for creating elaborate simulations of noise propagation into the Grid. Especially inside cities the propagation of noise is a highly complex phenomenon with a multitude of influencing factors.
Due to the vast amounts of data and the extensive algorithms involved, the need to accelerate the computation of an exhaustive acoustic simulation arises. The approach to speed up this task followed in the GDI-Grid-project, consists in dividing the investigation area into a set of tiles. These tiles are processed individually, thereby splitting up the calculation, so it can be run on several compute resources at the same time. Therefore, the workflow needs to be modified to include the necessary preprocessing and postprocessing steps i.e. splitting and merging of data sets. In addition to the Grid-enablement of the service, an appropriate user interface will be created.
Scenario 2: Flood simulation
TU Hamburg-Harburg is adapting an existing application for the generation of flood forecasting models to operate inside a Grid environment. Flood forecasting models are used to determine the extent of flooding events, allowing local authorities to do an estimation of the possible damages or facilitating effective early warning measures for residents of the area at risk. In urban regions not only the underlying terrain but also a detailed 3D city model has to be taken into account when simulating floods. For an accurate simulation these data sets are needed in a very high spatial resolution, increasing the amount of data to be processed exponentially. During the project a sophisticated mechanism for parallelizing the computation to speed up the generation of flood forecasting models is developed.
Scenario 3: Emergency routing
Routing algorithms based on high-quality data tend to be highly complex. In a typical disaster management scenario the complexity of routing increases because of constant changes in traceability of danger areas. Propagating plumes of toxic gases or spreading fires may result in arterial roads being impassable thus rendering long standing plans for evacuation useless. In such circumstances speeding up routing algorithms becomes necessary to guarantee up-to-date evacuation plans. Incorporating real-time sensor measurements and simulations like the flood forecasting model described in the previous paragraph furthermore increase the complexity of routing algorithms and the amount of relevant data. In such a scenario Grid computing might prove to be one way to satisfy the need for additional compute resources.
It has to be noted that the gain in speed comes with a caveat: Because of the way multiple users are generally sharing the same Grid resources, actual real-time-applications are not yet possible using Grid technologies. For reliable real-time-applications new scheduling algorithms ensuring the necessary quality of service have to be developed.
Lessons Learned
By combining storage and computing back-ends from Grid infrastructures with an OGC-compliant service front-end, SDI-users are enabled to access the vast storage and processing capacities the Grid has to offer. As GDI-Grid offers the means to preserve the conventional service interfaces, the complexity for the users stays comparably low. OGC-compliant data access services like WFS and WCS benefit from the storage resources provided by a Grid infrastructure, the data processing service WPS benefits from the vast computational resources the Grid provides.While the advantages of Grid computing sound promising for large data sets, the usage of Grid technologies comes with a lot of overhead. Authentication and Grid Security Infrastructure (GSI)-integration delay common request-response-cycles.
This is particularly significant for processing rather small data sets, especially when there are no parallel algorithms available. The parallelization of processes is complicated since there are several problems to solve. Is the process itself capable of being parallelized? If not, is it sensible to infer spatial parallelization by dissecting spatial data and stitching the results after processing?
A meaningful performance evaluation can’t be accomplished inside the current Grid environment, since all processing jobs are scheduled by the Grid middleware depending on the current workload. In some cases, processing jobs will reside for a couple of hours inside a stack before being processed. This also hinders the integration of real-time data provided by sensors. Anyway, it is likely that large amounts of data and complex processing routines lead to situations where Grid infrastructures are superior to standard computing platforms.
This “break-even-point”, where the speed gain achieved compensates for the overhead induced through Grid technologies, has to be defined based on case studies. Furthermore the type of the function for the determination of this point is not yet certain. The number of processors, that perform a parallel computation, and the size of the input data being the input parameters for this function, it could be linear, as well as an exponential or logarithmic function. An estimation for such a function is shown in figure 2.
The mentioned drawbacks, especially the need to do a sophisticated parallelization of the process logic, should not cover the fact, that a distribution of a calculation over a multitude of computational resources can be used to decrease processing time significantly. But this gain in speed is not a Grid-specific improvement. The same results could also be achieved using distributed computing mechanisms provided by a local cluster.
There are other characteristics that set Grid computing apart from using a local cluster. The most significant difference between conventional clusters and Grid computing is the ability of the Grid to provide a far larger number of computational resources as the need arises.
If a given process can be split up into thousands of subprocesses that may be executed independently, a Grid infrastructure could possibly execute all subprocesses simultaneously, thus significantly speeding up the process as a whole. A cluster on the other hand has a natural upper limit of the subprocesses that can be executed simultaneously. The same scaling advantages of Grid infrastructures apply to storage resources. Furthermore, Grid infrastructures not only offer a high level of security, it is one of the main concepts of Grid computing. Without security Grid computing would not be possible at all, therefore Grid infrastructures inherently support sophisticated security mechanisms. To establish a comparable level of security in local clusters takes a lot of time and effort.
———————————————————————————————
Authors:
Dr. Christian Kiehle, lat/lon GmbH, Bonn, Germany. Email: [email protected]
M.Sc. Alexander Padberg, Institute for Geography, Bonn, University. Email: [email protected]
Links
lat/lon GmbH
http://www.lat-lon.de
GIS workgroup, Department of Geography, University of Bonn
http://aggis.uni-bonn.de
The deegree project
http://www.deegree.org
German Ministry for Research and Education
http://www.bmbf.de
The German Grid initiative, D-Grid
http://www.d-grid.de
The GDI-Grid project
http://www.gdi-grid.de
GIS.science special issue on Grid computing in the spatial domain
http://portal.opengeospatial.org/files/?artifact_id=35975
The place for everybody to learn about Grid technology
http://www.gridcafe.org