« Kata Seven: How'd I Do? | Main | Kata Five - Bloom Filters »

January 28, 2007

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83451c41c69e200d835186bf269e2

Listed below are links to weblogs that reference Kata Six: Anagrams:

Comments

szeryf

The prime multiplication trick is nice one, but there is a simpler trick, too. You can sort the lowercased letters of the string (Aarhus -> aahrsu) and use that as a key in your dictionary.

Dave Aronson

Dave, it may be useful in some circumstances to have a bunch of word lists... but not for validating our results. For that, we need to be using the same list.

BTW, if you follow my URL link, be prepared for disappointment. My (free) web host chose this past weekend to bungle a server move so web access by domain is hosed for now. :-( I'm looking into alternate arrangements....

Martinho Fernandes

I'm getting the exact same results are Chris Hansen. However, if I run my algorithm with what I think is Dave's 234k words dictionary I get the "correct" 15048 sets.

Dave Thomas

Here's a link to lots of word lists: http://www.net-comber.com/wordurls.html

Mike Murray

Does anyone have the word list? The link in the post is broken.

Thanks.

Adrian

the link to the wordlist is broken.

Szczepan Faber

Nice kata but the word list is missing (yields 404 when clicked)

Vinothkumar.R

def sort(word):
res=[];
res+=word[:-1];
for i in range(len(res)):
for j in range(len(res)):
if(res[i] (res[i],res[j])=(res[j],res[i]);
ret="";
for i in range(len(res)):
ret+=res[i];

return ret;
fil=open("inp.txt","r");
x=fil.readlines();
dict={};
for i in range(len(x)):
alp=sort(x[i]);
if(not dict.has_key(alp)):
dict[alp]=[];
dict[alp]+=[x[i][:-1]];
for i in dict.keys():
l=dict[i];
for j in range(len(l)):
print l[j],
print


Vinothkumar.R

The word list link is showing file not found error. Could anyone give out the word list.
Thanks

Chris Hansen

@algol: if you are referring to disregarding duplicates (ignoring case), as is hinted by your alias (ALGOL appears as well as Algol), then you should also exclude arpanet, basic (all 3 occurrences), and several others. I don't see how this changed your count from 2531 to 2530. Care to explain?

I ignored case, counted punctuation (e.g. o'neill does not match lionel), and threw away duplicate words. The number of anagram groups I got after doing so was 2506, which I believe is at least close to the *right* answer. Part of the reason that an answer key was not given was that many of these types of implementation details were not specified, but it was an interesting exercise nonetheless. Did anyone else make the same choices and get the same result?

For those interested in performance, my implementation is in Scala and runs in about half a second. I know for a fact I could trim down that exec time, but I was more interested in keeping development time down instead ;)

algol

Regarding the out by one error, 2530 is the correct number. If you're getting 2531 there is another test that is needed. ;o)

PhilipStevens

Spent sometime playing with possibly implementations, making a lot of silly mistakes till I came up with some versions that worked how I wanted.

I was using python.

Below are my two implementations, the first merely goes line by line, converting the word into its alphabetical order and checking it against a dictionary (hash key), and either adding it to an existing one, or creating a new one for it.

I then had it print out all the anagrams. It would take around 1 second with the print command, 0.85 seconds without it.

(NOTE, ___ is indentation)
def anagram_finder():
____dict, file = {}, open('wordlist.txt', 'r')
____for line in file:
________strip_line = line.strip().lower()
________sort_line = str(sorted(strip_line))
________dict.setdefault(sort_line, []).append(strip_line)
____file.close()
____print '\n'.join(['Anagrams are: %s' % (v) for k, v in dict.items() if len(dict[k]) > 1])

My other version, which after some refraction seemed faster, was using the idea of prime numbers as keys suggested by Cameron Behar.

def anagram_finder():
____primeAlpha = {'a':2, 'b':3, 'c':5, 'd':7,'e' : 11, 'f':13, 'g':17, 'h':19,'i':23, 'j':29, 'k':31, 'l':37, 'm':41, 'n':43, 'o': 47, 'p':53,'q':59, 'r':61, 's':67, 't':71, 'u':73, 'v':79, 'w':83, 'x':89,y':97, 'z':101}
____dictAna = {}
____file = open ("wordlist.txt", "r")
____for line in file: # each line
________value = 1
________for s in line:
____________if s.lower() in primeAlpha:
________________value *= primeAlpha[s.lower()]
________dictAna.setdefault(value, []).append(line.strip())
____file.close()
____print "\n".join(['Anagrams are: %s' % (v) for k, v in dictAna.items() if len(dictAna[k]) > 1])

This was about 1 second with the print command, but 0.80 seconds without it.

This was a pretty fun challenge for me, since I am not the greatest programming, or mathematical thinker.

Cameron Behar

My friend and I once made a similar program, but we approached it differently. The most efficient way we could find was to reformat the dictionary file into a hash table. Each letter a-z was given a prime number. Then, the values for the letters of each word were multiplied and stored along with the string. We could then sort the dictionary based on the hash value so...

When the time came to find the anagrams, we simply did a binary search or similar method to find all the equal hash values (multiplication is commutative so order doesn't matter, and long integer comparison is still faster than string) and grabbed its respective string. In this way, we actually used data collision to our advantage. I also ran a mathematical proof, however, to assure no unintentional data collisions (that might come as a result of common factors) and found none.

This method, though a bit slow for the one time reformat, got our speeds down to ~48 ms in searches returning dozens of anagrams.

Michael Davies

The key appears to be to ignore case and spaces.

This simple Smalltalk code gives 2531, which seems to be a common answer - is Dave suffering from the dreaded Out By One Error?

dict := Dictionary new.
words := self loadFile: 'wordlist.txt'.
words do: [ :word | | matches sorted |
____sorted := word asLowercase sort copyWithout: Character space.
____matches := dict at: sorted ifAbsent: [Array new].
____dict at: sorted put: (matches copyWith: word)].
anagrams := dict values select: [ :each | each size > 1].
(underscores added for indentation).

I used the Dictionary (hash) for a quick prototype - a more efficient implementation would probably be to use a Trie (prefix tree).

Brian

Dave, please post your code, or at least the full output. No one seems
to be getting the same number of groups or words as you. I've pulled
together some links (and my own code) here:

http://brian-jaress.livejournal.com/3000.html

Jason Palmer

I apologize. The case in-sensitive comparison yielded 7886 anagrams not 7898.

Jason Palmer

Hello, first I'd like to thank you for code kata. I really enjoy it.

I designed a program to solve the anagram problem, however I got different results than you. Using a case-sensitive comparison I was able to extract 6034 anagrams. Using a case-insensitive comparison and stripping away special characters I was able to extract 7898 anagrams.

If you wouldn't mind, I'd really like to have you take a look at my script and help me to understand why our results differ. Also, I would be extremely interested in seeing how you got your program to execute in 1.5 seconds.

I've posted my code and the resulting anagrams on my personal website:
http://www.jason-palmer.com/ruby-on-rails-tutorials/anagrams

Thanks in advance!

The comments to this entry are closed.