Back to non-realistic coding this week (sorry, Martin). Let's solve some crossword puzzle clues.
In England, I used to waste hour upon hour doing newspaper crosswords. As crossword fans will know, English cryptic crosswords have a totally different feel to their American counterparts: most clues involve punning or word play, and there are lots of anagrams to work through. For example, a recent Guardian crossword had:
Down:
..
21. Most unusual form of arrest (6)
The hint is the phrase ‘form of,’ indicating that we’re looking for an anagram. Sure enough ‘arrest’ has six letters, and can be arranged nicely into ‘rarest,’ meaning ‘most unusual.’ (Other anagrams include raster, raters, Sartre, and starer)
A while back we had a thread on the Ruby mailing list about finding anagrams, and I’d like to resurrect it here. The challenge is fairly simple: given a file containing one word per line, print out all the combinations of words that are anagrams; each line in the output contains all the words from the input that are anagrams of each other. For example, your program might include in its output:
kinship pinkish enlist inlets listen silent boaster boaters borates fresher refresh sinks skins knits stink rots sort
If you run this on the word list here you should find 2,530 sets of anagrams (a total of 5,680 words). Running on a larger dictionary (about 234k words) on my OSX box produces 15,048 sets of anagrams (including all-time favorites such as actaeonidae/donatiaceae, crepitant/pittancer, and (for those readers in Florida) electoral/recollate).
For added programming pleasure, find the longest words that are anagrams, and find the set of anagrams containing the most words (so "parsley players replays sparely" would not win, having only four words in the set).
Kata Objectives
Apart from having some fun with words, this kata should make you think somewhat about algorithms. The simplest algorithms to find all the anagram combinations may take inordinate amounts of time to do the job. Working though alternatives should help bring the time down by orders of magnitude. To give you a possible point of comparison, I hacked a solution together in 25 lines of Ruby. It runs on the word list from my web site in 1.5s on a 1GHz PPC. It’s also an interesting exercise in testing: can you write unit tests to verify that your code is working correctly before setting it to work on the full dictionary.
The prime multiplication trick is nice one, but there is a simpler trick, too. You can sort the lowercased letters of the string (Aarhus -> aahrsu) and use that as a key in your dictionary.
Posted by: szeryf | August 30, 2010 at 07:23 AM
Dave, it may be useful in some circumstances to have a bunch of word lists... but not for validating our results. For that, we need to be using the same list.
BTW, if you follow my URL link, be prepared for disappointment. My (free) web host chose this past weekend to bungle a server move so web access by domain is hosed for now. :-( I'm looking into alternate arrangements....
Posted by: Dave Aronson | June 25, 2010 at 01:13 PM
I'm getting the exact same results are Chris Hansen. However, if I run my algorithm with what I think is Dave's 234k words dictionary I get the "correct" 15048 sets.
Posted by: Martinho Fernandes | December 16, 2009 at 12:49 PM
Here's a link to lots of word lists: http://www.net-comber.com/wordurls.html
Posted by: Dave Thomas | October 22, 2009 at 04:38 PM
Does anyone have the word list? The link in the post is broken.
Thanks.
Posted by: Mike Murray | October 22, 2009 at 04:36 PM
the link to the wordlist is broken.
Posted by: Adrian | October 18, 2009 at 10:52 PM
Nice kata but the word list is missing (yields 404 when clicked)
Posted by: Szczepan Faber | September 18, 2008 at 05:50 AM
def sort(word):
res=[];
res+=word[:-1];
for i in range(len(res)):
for j in range(len(res)):
if(res[i] (res[i],res[j])=(res[j],res[i]);
ret="";
for i in range(len(res)):
ret+=res[i];
return ret;
fil=open("inp.txt","r");
x=fil.readlines();
dict={};
for i in range(len(x)):
alp=sort(x[i]);
if(not dict.has_key(alp)):
dict[alp]=[];
dict[alp]+=[x[i][:-1]];
for i in dict.keys():
l=dict[i];
for j in range(len(l)):
print l[j],
print
Posted by: Vinothkumar.R | August 26, 2008 at 11:34 PM
The word list link is showing file not found error. Could anyone give out the word list.
Thanks
Posted by: Vinothkumar.R | August 26, 2008 at 11:30 PM
@algol: if you are referring to disregarding duplicates (ignoring case), as is hinted by your alias (ALGOL appears as well as Algol), then you should also exclude arpanet, basic (all 3 occurrences), and several others. I don't see how this changed your count from 2531 to 2530. Care to explain?
I ignored case, counted punctuation (e.g. o'neill does not match lionel), and threw away duplicate words. The number of anagram groups I got after doing so was 2506, which I believe is at least close to the *right* answer. Part of the reason that an answer key was not given was that many of these types of implementation details were not specified, but it was an interesting exercise nonetheless. Did anyone else make the same choices and get the same result?
For those interested in performance, my implementation is in Scala and runs in about half a second. I know for a fact I could trim down that exec time, but I was more interested in keeping development time down instead ;)
Posted by: Chris Hansen | April 30, 2008 at 01:43 AM
Regarding the out by one error, 2530 is the correct number. If you're getting 2531 there is another test that is needed. ;o)
Posted by: algol | February 11, 2008 at 09:04 PM
Spent sometime playing with possibly implementations, making a lot of silly mistakes till I came up with some versions that worked how I wanted.
I was using python.
Below are my two implementations, the first merely goes line by line, converting the word into its alphabetical order and checking it against a dictionary (hash key), and either adding it to an existing one, or creating a new one for it.
I then had it print out all the anagrams. It would take around 1 second with the print command, 0.85 seconds without it.
(NOTE, ___ is indentation)
def anagram_finder():
____dict, file = {}, open('wordlist.txt', 'r')
____for line in file:
________strip_line = line.strip().lower()
________sort_line = str(sorted(strip_line))
________dict.setdefault(sort_line, []).append(strip_line)
____file.close()
____print '\n'.join(['Anagrams are: %s' % (v) for k, v in dict.items() if len(dict[k]) > 1])
My other version, which after some refraction seemed faster, was using the idea of prime numbers as keys suggested by Cameron Behar.
def anagram_finder():
____primeAlpha = {'a':2, 'b':3, 'c':5, 'd':7,'e' : 11, 'f':13, 'g':17, 'h':19,'i':23, 'j':29, 'k':31, 'l':37, 'm':41, 'n':43, 'o': 47, 'p':53,'q':59, 'r':61, 's':67, 't':71, 'u':73, 'v':79, 'w':83, 'x':89,y':97, 'z':101}
____dictAna = {}
____file = open ("wordlist.txt", "r")
____for line in file: # each line
________value = 1
________for s in line:
____________if s.lower() in primeAlpha:
________________value *= primeAlpha[s.lower()]
________dictAna.setdefault(value, []).append(line.strip())
____file.close()
____print "\n".join(['Anagrams are: %s' % (v) for k, v in dictAna.items() if len(dictAna[k]) > 1])
This was about 1 second with the print command, but 0.80 seconds without it.
This was a pretty fun challenge for me, since I am not the greatest programming, or mathematical thinker.
Posted by: PhilipStevens | September 28, 2007 at 06:57 AM
My friend and I once made a similar program, but we approached it differently. The most efficient way we could find was to reformat the dictionary file into a hash table. Each letter a-z was given a prime number. Then, the values for the letters of each word were multiplied and stored along with the string. We could then sort the dictionary based on the hash value so...
When the time came to find the anagrams, we simply did a binary search or similar method to find all the equal hash values (multiplication is commutative so order doesn't matter, and long integer comparison is still faster than string) and grabbed its respective string. In this way, we actually used data collision to our advantage. I also ran a mathematical proof, however, to assure no unintentional data collisions (that might come as a result of common factors) and found none.
This method, though a bit slow for the one time reformat, got our speeds down to ~48 ms in searches returning dozens of anagrams.
Posted by: Cameron Behar | September 18, 2007 at 02:47 PM
The key appears to be to ignore case and spaces.
This simple Smalltalk code gives 2531, which seems to be a common answer - is Dave suffering from the dreaded Out By One Error?
dict := Dictionary new.
words := self loadFile: 'wordlist.txt'.
words do: [ :word | | matches sorted |
____sorted := word asLowercase sort copyWithout: Character space.
____matches := dict at: sorted ifAbsent: [Array new].
____dict at: sorted put: (matches copyWith: word)].
anagrams := dict values select: [ :each | each size > 1].
(underscores added for indentation).
I used the Dictionary (hash) for a quick prototype - a more efficient implementation would probably be to use a Trie (prefix tree).
Posted by: Michael Davies | May 11, 2007 at 03:43 PM
Dave, please post your code, or at least the full output. No one seems
to be getting the same number of groups or words as you. I've pulled
together some links (and my own code) here:
http://brian-jaress.livejournal.com/3000.html
Posted by: Brian | April 30, 2007 at 02:17 AM
I apologize. The case in-sensitive comparison yielded 7886 anagrams not 7898.
Posted by: Jason Palmer | February 12, 2007 at 03:27 PM
Hello, first I'd like to thank you for code kata. I really enjoy it.
I designed a program to solve the anagram problem, however I got different results than you. Using a case-sensitive comparison I was able to extract 6034 anagrams. Using a case-insensitive comparison and stripping away special characters I was able to extract 7898 anagrams.
If you wouldn't mind, I'd really like to have you take a look at my script and help me to understand why our results differ. Also, I would be extremely interested in seeing how you got your program to execute in 1.5 seconds.
I've posted my code and the resulting anagrams on my personal website:
http://www.jason-palmer.com/ruby-on-rails-tutorials/anagrams
Thanks in advance!
Posted by: Jason Palmer | February 12, 2007 at 02:43 PM