Count-distinct problem

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

In computer science, the count-distinct problem[1] (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, unique visitors to a web site, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.

Formal definition

[edit | edit source]
Instance: Consider a stream of elements x1,x2,,xs with repetitions. Let n denote the number of distinct elements in the stream, with the set of distinct elements represented as {e1,e2,,en}.
Objective: Find an estimate n^ of n using only m storage units, where mn.

An example of an instance for the cardinality estimation problem is the stream: a,b,a,c,d,b,d. For this instance, n=|{a,b,c,d}|=4.

Naive solution

[edit | edit source]

The naive solution to the problem is as follows:

 Initialize a counter, c, to zero, c0.
 Initialize an efficient dictionary data structure, D, such as hash table or search tree in which insertion and membership can be performed quickly.  
 For each element xi, a membership query is issued. 
     If xi is not a member of D (xiD)
         Add xi to D
         Increase c by one, cc+1
     Otherwise (xiD) do nothing.
 Output n=c.

As long as the number of distinct elements is not too big, D fits in main memory and an exact answer can be retrieved. However, this approach does not scale for bounded storage, or if the computation performed for each element xi should be minimized. In such a case, several streaming algorithms have been proposed that use a fixed number of storage units.

HyperLogLog algorithm

[edit | edit source]

Streaming algorithms

[edit | edit source]

To handle the bounded storage constraint, streaming algorithms use a randomization to produce a non-exact estimation of the distinct number of elements, n. State-of-the-art estimators hash every element ej into a low-dimensional data sketch using a hash function, h(ej). The different techniques can be classified according to the data sketches they store.

Min/max sketches

[edit | edit source]

Min/max sketches[2][3] store only the minimum/maximum hashed values. Examples of known min/max sketch estimators: Chassaing et al.[4] presents max sketch which is the minimum-variance unbiased estimator for the problem. The continuous max sketches estimator[5] is the maximum likelihood estimator. The estimator of choice in practice is the HyperLogLog algorithm.[6]

The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element ej is associated with a uniform RV, h(ej)U(0,1), the expected minimum value of h(e1),h(e2),,h(en) is 1/(n+1). The hash function guarantees that h(ej) is identical for all the appearances of ej. Thus, the existence of duplicates does not affect the value of the extreme order statistics.

There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation[7] describes the Flajolet–Martin algorithm, a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by Daniel M. Kane, Jelani Nelson, and David P. Woodruff.[8]

Bottom-m sketches

[edit | edit source]

Bottom-m sketches [9] are a generalization of min sketches, which maintain the m minimal values, where m1. See Cosma et al.[2] for a theoretical overview of count-distinct estimation algorithms, and Metwally [10] for a practical overview with comparative simulation results.

Python implementation of Knuth's CVM algorithm

[edit | edit source]
def algorithm_d(stream, s: int):
    p = 1.0
    buffer = {}
    for a in stream:
        if a in buffer:
            buffer.pop(a)
        u = uniform(0, 1)
        if u < p:
            if len(buffer) < s:
                buffer[a] = u
            else:
                a_p, u_p = max(buffer.items(), key=lambda x: x[1])
                if u > u_p:
                    p = u
                else:
                    buffer.pop(a_p)
                    buffer[a] = u
                    p = u_p
    return len(buffer) / p

CVM algorithm

[edit | edit source]

Compared to other approximation algorithms for the count-distinct problem the CVM Algorithm[11] (named by Donald Knuth after the initials of Sourav Chakraborty, N. V. Vinodchandran, and Kuldeep S. Meel) uses sampling instead of hashing. The CVM Algorithm provides an unbiased estimator for the number of distinct elements in a stream,[12] in addition to the standard (ε-δ) guarantees. Below is the CVM algorithm, including the slight modification by Donald Knuth. [12]

 Initialize p1
 Initialize max buffer size s, where s1
 Initialize an empty buffer, B  
 For each element at in data stream A of size n do: 
   If (at,u),u is in B then
       Delete (at,u) from B
   u random number in [0,1)
   If u<p then
       If |B|<s then
           insert (at,u) in B
       else
           (a,u) such that u=max{u:(a,u)B,a} /* (a,u) whose u is maximum in B */
           If u>u then
               pu
           else
               Replace (a,u) with (at,u)
               pu
 End For
 return |B|/p.

The previous version of the CVM algorithm is improved with the following modification by Donald Knuth, that adds the while loop to ensure B is reduced. [12]

 Initialize p1
 Initialize max buffer size s, where s1
 Initialize an empty buffer, B  
 For each element at in data stream A of size n do: 
   If at is in B then
       Delete at from B
   u random number in [0,1)
   If up then
       Insert (at,u) into B
   While |B|=su<p then
       Remove every element of (a,u) of B with u>p2
       pp2
   End While
   If u<p then
       Insert (at,u) into B
 End For
 return |B|/p.

Weighted count-distinct problem

[edit | edit source]

In its weighted version, each element is associated with a weight and the goal is to estimate the total sum of weights. Formally,

Instance: A stream of weighted elements x1,x2,,xs with repetitions, and an integer m. Let n be the number of distinct elements, namely n=|{x1,x2,,xs}|, and let these elements be {e1,e2,,en}. Finally, let wj be the weight of ej.
Objective: Find an estimate w^ of w=j=1nwj using only m storage units, where mn.

An example of an instance for the weighted problem is: a(3),b(4),a(3),c(2),d(3),b(4),d(3). For this instance, e1=a,e2=b,e3=c,e4=d, the weights are w1=3,w2=4,w3=2,w4=3 and wj=12.

As an application example, x1,x2,,xs could be IP packets received by a server. Each packet belongs to one of n IP flows e1,e2,,en. The weight wj can be the load imposed by flow ej on the server. Thus, j=1nwj represents the total load imposed on the server by all the flows to which packets x1,x2,,xs belong.

Solving the weighted count-distinct problem

[edit | edit source]

Any extreme order statistics estimator (min/max sketches) for the unweighted problem can be generalized to an estimator for the weighted problem .[13] For example, the weighted estimator proposed by Cohen et al.[5] can be obtained when the continuous max sketches estimator is extended to solve the weighted problem. In particular, the HyperLogLog algorithm[6] can be extended to solve the weighted problem. The extended HyperLogLog algorithm offers the best performance, in terms of statistical accuracy and memory usage, among all the other known algorithms for the weighted problem.

See also

[edit | edit source]

References

[edit | edit source]
  1. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  2. ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  3. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  4. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  5. ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  6. ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  7. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  8. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  9. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  10. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  11. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  12. ^ a b c Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  13. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).