The National Science Foundation has recently awarded a grant to researchers at SDSC to explore new ways to manage extremely large data sets hosted on massive clusters, which have become known as computing “clouds”.
The project will study dynamic strategies for provisioning such applications by doing a performance evaluation of alternative strategies for serving very large data sets. The cloud platforms that will be used in the project will be the Google-IBM CluE cluster and the HP-Intel-Yahoo cluster, both of which have been assembled in collaboration with NSF for cloud computing research. The LiDAR processing application hosted at the OpenTopography portal has been selected as the representative application for this study. The application allows users to (i) subset remote sensing data (stored as “point cloud” data sets), (ii) process it using different algorithms, and (iii) visualize the output. The project will study alternative implementations for each step—using database technology as well as Hadoop (http://www.hadoop.org) —and run a series of performance evaluation experiments.
Cloud platforms with thousands of processors and access to hundreds of terabytes of storage provide a natural environment for implementing OpenTopography processing routines, which are highly data-parallel in nature.
The studies will contribute towards our understanding of the performance tradeoffs and feasibility of employing dynamic provisioning strategies for serving large scientific data sets to a broad user community. A possible outcome of this study is a reassessment of how data archives are implemented and how data sets are served to a broad user community, for example, using on-demand and dynamic approaches for provisioning data sets, as opposed to the current static approach. Cloud computing-based implementations can be exposed to the OpenTopography user community via a Services Oriented Architecture (SOA), such as that employed by the GEON Portal, thereby bringing the benefits of massively-scaled computing resources to a large community of users.
Specifically, SDSC researchers will explore the use of compute clouds to dynamically provision and manage large-scale scientific datasets. This is in contrast to the current approach using a traditional parallel relational database management system (RDBMS) architecture, which is more structured but also more static. The SDSC team will investigate the feasibility of the cloud computing approach versus known conventional approaches, while evaluating the trade-offs, advantages, and disadvantages.
The NSF CluE program