A Network-Aware Distributed Storage Cache for Data Intensive Environments

Abstract
Modern scientific computing involves organizing, moving, visualizing, and analyzing massive amounts of data at multiple sites around the world. The technologies, the middleware services, and the architectures that are used to build useful high-speed, wide area distributed systems, constitute the field of data intensive computing. In this paper the authors describe an architecture for data intensive applications where they use a high-speed distributed data cache as a common element for all of the sources and sinks of data. This cache-based approach provides standard interfaces to a large, application-oriented, distributed, on-line, transient storage system. They describe their implementation of this cache, how they have made it network aware, and how they do dynamic load balancing based on the current network conditions. They also show large increases in application throughput by access to knowledge of the network conditions.