« Kata Six: Anagrams | Main | Kata Four: Data Munging »

January 28, 2007

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451c41c69e200d8341c850053ef

Listed below are links to weblogs that reference Kata Five - Bloom Filters:

Comments

szeryf

And the number of bits in the comment above is the number of bits per hash function, so the length of bit vector was 2^(number of bits) and it needed 2^(number of bits - 3) bytes. So for the 19 bits it needs 64kB of memory.

szeryf

Here's my solution in Java: http://gist.github.com/557075

What I noticed (and thought it was strange at first) that the number of bits must be at least a couple of time bigger than the number of words in the dictionary. Otherwise the Bloom Filter's bit vector will be almost completely filled and there will be many false positives. There is also no point in increasing the hash functions number for small bit vector as this will also make it even more filled.

My results for 3 hash functions:

16 bits - 88% filled - 56% of false positives
17 bits - 65% filled - 20% of false positives
18 bits - 40% filled - 4% of false positives
19 bits - 23% filled - 1% of false positives

And the best hash functions I found is fragments of MD5 hash for the string.


travesti

This was an interesting puzzle for me, as I've not heard of Bloom filters before - it's an interesting technique to have up your sleeve. Keep up the good work!

Jerome Baum

@Matt: That wouldn't work. Since there are an infinite amount of such words, you'll end up with a full bitmap (of 1s) which will report every word as "not in the dictionary" -- then you'll have the positive filter reporting "it's in there" and the negative one reporting "it's not."

Matt Powell

See, this is fantastic! All you have to do is generate a bitset for all the words that *aren't* in the target language, and your spell-checker is perfect!

Edward S. Marshall

The new URL for the word list appears to be archive.org, since it looks like the list is no longer online:

http://web.archive.org/web/20070704025714/http://pragmaticprogrammer.com/katadata/wordlist.txt

Michael Davies

Sorry, that should have read (1-e^(-kn/m))^k of course.

Michael Davies

There's a small error in the linked summary - the probability of a false positive should be approximated to (1-e^(-kn/m)) rather than (1-e^(kn/m)) as it states.

This was an interesting puzzle for me, as I've not heard of Bloom filters before - it's an interesting technique to have up your sleeve. Keep up the good work!

The comments to this entry are closed.