Many computer applications require finding quantitative characteristics of very large data sets. The situation occurs, for instance, in data mining of text-based data as well as in the detection of anomalies in router activities. Randomization can lead to surprisingly efficient algorithms: for instance, using a single pass over the data and an auxiliary storage of two kilobytes, one can determine the number of distinct elements in a massive data set within an accuracy typically better than 2\%. We shall review some of these algorithms, including "Probabilistic Counting" of Flajolet-Martin and the more recent Loglog Counting algorithm of Durand-Flajolet.
The very design of these algorithms is based on a thorough analysis of associated probability distributions by methods of analytic combinatorics, which then provides the right dimensionings, the exact constants making the algorithms unbiased, and the correct probabilistic estimates of a risk of anomalous behavior.