Estimating simple functions on the union of data streams
- 3 July 2001
- conference paper
- Published by Association for Computing Machinery (ACM)
- p. 281-291
- https://doi.org/10.1145/378580.378687
Abstract
Massive data sets often arise as physically distributed, parallel data streams. We present algorithms for estimating simple functions on the union of such data streams, while using only logarithmic space per stream. Each processor observes only its own stream, and communicates with the other processors only after observing its entire stream. This models the set-up in current network monitoring products. Our algorithms employ a novel coordinated sampling technique to extract a sample of the union; this sample can be used to estimate aggregate functions on the union. The technique can also be used to estimate aggregate functions over the distinct “labels” in one or more data streams, e.g., to determine the zeroth frequency moment (i.e., the number of distinct labels) in one or more data streams. Our space and time bounds are the best known for these problems, and our logarithmic space bounds for coordinated sampling contrast with polynomial lower bounds for independent sampling. We relate our distributed streams model to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms.Keywords
This publication has 13 references indexed in Scilit:
- Towards estimation error guarantees for distinct valuesPublished by Association for Computing Machinery (ACM) ,2000
- Join synopses for approximate query answeringPublished by Association for Computing Machinery (ACM) ,1999
- On Randomized One-round Communication Complexitycomputational complexity, 1999
- Tracking join and self-join sizes in limited storagePublished by Association for Computing Machinery (ACM) ,1999
- New sampling-based summary statistics for improving approximate query answersPublished by Association for Computing Machinery (ACM) ,1998
- Size-Estimation Framework with Applications to Transitive Closure and ReachabilityJournal of Computer and System Sciences, 1997
- The space complexity of approximating the frequency momentsPublished by Association for Computing Machinery (ACM) ,1996
- Private vs. common random bits in communication complexityInformation Processing Letters, 1991
- A linear-time probabilistic counting algorithm for database applicationsACM Transactions on Database Systems, 1990
- Probabilistic counting algorithms for data base applicationsJournal of Computer and System Sciences, 1985