Trigrams can be used to mutate text into new, surreal, forms. But what heuristics do we apply to get a reasonable result?
OK, perhaps not. But that won’t stop us trying. I coded up a quick program to generate some swash-buckling scientific adventure on demand. It came up with:
- …it was in the wind that was what he thought was his companion. I think would be a good one and accordingly the ship their situation improved. Slowly so slowly that it beat the band! You’d think no one was a low voice. "Don’t take any of the elements and the inventors of the little Frenchman in the enclosed car or cabin completely fitted up in front of the gas in the house and wringing her hands. "I’m sure they’ll fall!"
- She looked up at them. He dug a mass of black vapor which it had refused to accept any. As for Mr. Swift as if it goes too high I’ll warn you and you can and swallow frequently. That will make the airship was shooting upward again and just before the raid wouldn’t have been instrumental in capturing the scoundrels right out of jail."
Stylistically, it’s Victor Appleton meets Dylan Thomas. Technically, it’s all done with trigrams.
Trigram analysis is very simple. Look at each set of three adjacent words in a document. Use the first two words of the set as a key, and remember the fact that the third word followed that key. Once you’ve finished, you know the list of individual words that can follow each two word sequence in the document. For example, given the input:
I wish I may I wish I might
You might generate:
"I wish" => ["I", "I"]
"wish I" => ["may", "might"]
"may I" => ["wish"]
"I may" => ["I"]
This says that the words "I wish" are twice followed by the word "I", the words "wish I" are followed once by "may" and once by "might" and so on.
To generate new text from this analysis, choose an arbitrary word pair as a starting point. Use these to look up a random next word (using the table above) and append this new word to the text so far. This now gives you a new word pair at the end of the text, so look up a potential next word based on these. Add this to the list, and so on. In the previous example, we could start with "I may". The only possible next word is "I", so now we have:
I may I
The last two words are "may I", so the next word is "wish". We then look up "I wish", and find our choice is constrained to another "I".
I may I wish I
Now we look up "wish I", and find we have a choice. Let’s choose "may".
I may I wish I may
Now we’re back where we started from, with "I may." Following the same sequence, but choosing "might" this time, we get:
I may I wish I may I wish I might
At this point we stop, as no sequence starts "I might."
Given a short input text, the algorithm isn’t too interesting. Feed it a book, however, and you give it more options, so the resulting output can be surprising.
For this kata, try implementing a trigram algorithm that generates a couple of hundred words of text using a book-sized file as input. Project Gutenberg is a good source of online books (Tom Swift and His Airship is here). Be warned that these files have DOS line endings (carriage return followed by newline).
Objectives
Kata’s are about trying something many times. In this one, what we’re experimenting with is not just the code, but the heuristics of processing the text. What do we do with punctuation? Paragraphs? Do we have to implement backtracking if we chose a next word that turns out to be a dead end?
- I’ll fire the signal and the fun will commence…
My girlfriend and I just finished a programming exercise where we each tried to write an n-gram generator (done before I saw this, but interesting that you use it)
An interesting problem with this program is what to do when your object hierarchy gets larger than your available main memory, and you have to go back and break your storage mechanism into smaller chunks so that you can save and load words from harddisk as needed by the generator.
Some things to consider are the size of your chunks (smaller means more lookups, but less space taken up, but larger means more likely to have repeat lookups before having to go fetch again), and indexing structure (alphabetically close words aren't particularly likely to follow eachother, but finding another algorithm for choosing an indexing structure for your words could be complex), and actual storage medium (Dependent on chunk size, plaintext or some structured text such as XML might be quicker to read through)
(To be fair, my n-gram generator without indexing could hold 100 MB worth of text parsed over a 5-gram tree before it decided it didn't want to live..)
Posted by: Chris | August 21, 2008 at 09:31 PM