且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

[20141228]关于bloom filter.txt

更新时间:2022-09-12 12:47:58

[20141228]关于bloom filter.txt

--系统升级到11.2.0.4 ,执行计划经常出现bloom filter,自己对这些一点都不了解.
--自己做了google,能查到的资料不多,eygle的blog讲解,我水平有限,至少第1次没看懂,不过里面指向一个链接:
--http://antognini.ch/papers/BloomFilters20080620.pdf

--我仔细阅读,感觉还是不好理解,决定把里面的测试例子执行1次,后面做一个测试:

--实际上bloom filter很早就已经存在,1970年由Burton H.Bloom发明的.我转载里面的介绍:

A bloom filter is a data structure used to support membership queries.Simply put,a bloom filter is
used to test whether an element is a member of a giver set or not. Its main properties are:

. The amount of space needed to store the bloom filter is small compared to the amount of data
  belonging to the set being tested.
. The time needed to check whether an element is a member of a given set is independent of
  the number of elements contained in the set.
. False negatives not possible.
. False positives are possible, bit their frequency can be controlled. In practice,it is a tradeoff
  between space/time efficiency and the false positive frequency.

A bloom filter is based on an array of an bits (b1, b2, ..., bm) that are initially set to 0. To understand
how a bloom filter works, it is essential to describe how these bits are set and checked. For this
purpose, k independent hash functions (hi, h2, ..., hk), each returning a value between 1 and m,
are used. In order to "store" a given element into the bit array, each hash function must be
applied to it and, based on the return value r of each function (r1, r2, ..., rk), the bit with the offset r
is set to 1. Since there are k hash functions, up to k bits in the bit array are set to 1 (it might be
less because several hash functions might return the same value). The following figure is an
example where m = 16, k=4 and e is the element to be "stored" in the bit array.

r1 = h1(e) = 8  -> set b8 to 1
r2 = h2(e) = 1  -> set b1 to 1
r3 = h3(e) = 6  -> set b6 to 1
r4 = h4(e) = 13 -> set b13 to 1

To check whether an element is /"stored" in the bit array, the process is similar. The only
difference is that instead of setting k bits in the array, it is checked whether any of them are set to
0. If yes, this implies that the element is not /"stored" in the bit array.

Let's take a look at an example that illustrates how a bloom filter might be used in practice. An
application receives input data at a high throughput. Before processing it, it has to check whether
that data belongs to a given set stored in a database table. Note that the rejection rate of this
validation is very high. To perform the validation, the application has to call a remote service.
Unfortunately, the time needed for the network roundtrip and to actually do the lookup is far too
long. Since it is not possible to further optimize it (for example, the network latency is subject to
physical limitations), another method must be implemented. ln such cases, caches come to mind.
The general idea is to duplicate the data stored remotely in a local cache and then to do the
validation locally. In this specific case, let's say that it is not practicable. Not enough resources are
available locally to store the data. In such a situation, a bloom filter might be useful. In fact, as
described above, bloom filters need much less space than the actual set. Since the bloom filter is
subject to false positives, the idea is to use it to discard the maximum amount of input data before
calling the remote service. As a result, the original validation is not made faster but it is used
much less frequently.

--还是通过例子来说明问题:

/* Formatted on 2014/12/28 21:47:07 (QP5 v5.227.12220.39754) */
CREATE OR REPLACE PACKAGE bloom_filter
IS
   PROCEDURE init (p_m IN BINARY_INTEGER, p_n IN BINARY_INTEGER);

   FUNCTION add_value (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER;

   FUNCTION contain (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER;
END bloom_filter;
/

CREATE OR REPLACE PACKAGE BODY bloom_filter
IS
   TYPE t_bitarray IS TABLE OF BOOLEAN;

   g_bitarray   t_bitarray;
   g_m          BINARY_INTEGER;
   g_k          BINARY_INTEGER;

   PROCEDURE init (p_m IN BINARY_INTEGER, p_n IN BINARY_INTEGER)
   IS
   BEGIN
      g_m := p_m;
      g_bitarray := t_bitarray ();
      g_bitarray.EXTEND (p_m);
      FOR i IN g_bitarray.FIRST .. g_bitarray.LAST
      LOOP
         g_bitarray (i) := FALSE;
         --dbms_output.put_line(i);
      END LOOP;

      g_k := CEIL (p_m / p_n * LN (2));
   END init;

   FUNCTION add_value (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER
   IS
   BEGIN
      dbms_random.seed (p_value);

      FOR i IN 0 .. g_k
      LOOP
         g_bitarray (dbms_random.VALUE (1, g_m)) := TRUE;
      END LOOP;

      RETURN 1;
   END add_value;

   FUNCTION contain (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER
   IS
      l_ret   BINARY_INTEGER := 1;
   BEGIN
      dbms_random.seed (p_value);

      FOR i IN 0 .. g_k
      LOOP
         IF NOT g_bitarray (dbms_random.VALUE (1, g_m))
         THEN
            l_ret := 0;
            EXIT;
         END IF;
      END LOOP;

      RETURN l_ret;
   END contain;
END bloom_filter;
/

create table t as select dbms_random.string('u',100) as value from dual connect by level

SCOTT@test> execute bloom_filter.init(16484,1000);
PL/SQL procedure successfully completed.

SCOTT@test> select count(bloom_filter.add_value(value)) from t where rownumCOUNT(BLOOM_FILTER.ADD_VALUE(VALUE))
------------------------------------
                                1000

SCOTT@test> select count(*) from t where bloom_filter.contain(value) =1;
  COUNT(*)
----------
      1006

--执行初始化bloom_filter.init,就是设置g_bitarray的16384个二进制位为0.并且设置 g_k := CEIL (p_m / p_n * LN (2))=12;
SCOTT@test> select ceil(16384/1000*ln(2)) from dual ;
CEIL(16384/1000*LN(2))
----------------------
                    12

--  bloom_filter.add_value 就是通过函数(这里是取随机数)变化加入g_bitarray中,相应的二进制位设置1.

-- 后面 执行的select count(*) from t where bloom_filter.contain(value) =1;检查1e4个字串转化的位是否在g_bitarray中,当然存
-- 在并不一定正确.
-- 可以发现存在6个冲突的.

-- 如果能把例子执行,很容易理解bloom filter算法.实际上利用少量的空间保存大量的信息.前面的例子使用16384位=2k的信息保存
1000*100的信息,当然也存在一定的冲突,实际上取决于位图数量以及保存信息的数量.


-