Automatic text generator using Markov Chains (in C#)
Written on March 30, 2010, by Visar Shehu.

Prandaj, pra, n’e doni fisin,
mali, bregu edhe Malcija
prej njaj goje sod t’brohrisim:
Me gjuhë t’veten e lèn mbas dore.
Në gjuhë shqype nanat tona
qi prej djepit na kanë thânun,
se asht një Zot, qi do ta dona;
njatë, qi jetën na ka dhânun;
edhe shqyp na thanë se Zoti
për shqyptarë Shqypninë e fali,
se sa t’enden stina e nji tërmetit,
ngjashtu â’ gjuha e jonë shqyptare.
Ah! po; â’ e toskë, malci e qyteta,
gjuhën t’uej kurr mos ta gzojn kta djalë mbas dore.
Në gjuhë hyjnore;

In the first glance this text might look like gibberish, however to some people it might resemble to something they have seen or read previously.

This one as well:

praying and looking out the
thigh-bones, wrapped them round in two sons of Olympus, and
plunged into the silver hilt of his tent and all day Achilles called them in offering insult no man. Therefore I prayed, and now you alone of mortals. If I have
ever decked your own prize, while I mean to go out
with the host to plague the people, but upon
the tenth day long the host. But of this she left its parent stem upon the other side. Then
uprose smooth-tongued Nestor, the facile speaker of the hoar
sea,

What we are seeing in the previous two examples is automatically generated text using Markov Chains. The first one resembles the style of a famous Albanian poet: Gjergj Fishta.  The other one resembles The Illiad by Homer.

I needed to benchmark two databases: MySql and MongoDB. However, since I did not have any data (that would resemble human comments) I had to find a way to automatically generate them. That is why I choose to write an application that would automatically generate text.

One can say that automatically generating text is simple, just taking random words from the dictionary and you are done. However, this approach has one big problem: the text would not look natural at all. This is because we do not take into consideration the statistics of a language. In a language, each word is not followed randomly by another word; e.g., the word “hello” in the English language is more likely to be followed by the word “world” than by the word “hi”.

Markov Chain is a process of generating a finite state sequence of events (in our case words), in which you can transfer from each state to another based on some probabilistic model. E.g., suppose your current state is “hello”. From here you can jump to another state like: “world” or “home” because there is a probability that these two words proceed the word “hello”.

Building a Markov Model is fairly simple:

First you will need a text file, from which you will build the model.

For each word from that text file, create e k-level list of words proceeding it. k is a fixed number of words that proceed each word. The higher k is, phrases will make more sense. After you are done with all the words you will have a table, with each row representing a starting word, followed by k-other words.

Let’s suppose our text is the Wikipedia Article for Markov Chains (http://en.wikipedia.org/wiki/Markov_chain).

If k=2 then our table would have the following structure:

In Mathematics, A
Mathematics, A Markov
A Markov Chain,
Markov Chain, Named
….

If k=3 then we would have the following table:

In Mathematics, A Markov
Mathematics A Markov Chain,
A Markov Chain, Named

One can continue with the same pattern for k=4, 5 … N.

Since now have our model, we can generate text from it:

1. Pick a random word (from our list of words). Let this word be the first word of your automatically generated text.

2. Randomly select one of the elements from our table that starts with the word that we picked.

3. Append all subsequent words from the list to the generated text.

4. Repeat from step 2, for the last word on your list.

My approach is a word level Markov Chain generator. One can change the same algorithm to generate character level chains, which also generate interesting text patterns.

You can download the entire solution written in C#:

Markov Chain Source
Markov Chain Source
MarkovTextGenerator.rar
Version: 1.0
38.8 KiB
796 Downloads
Details...

  • Freddy

    Its nice this application. How is the behavior of long-run chain? Does ‘converges’ this to some chain? I don’t know if my question has sense indeed but I suppose if you have a finite Markov chain, you have a stationary distribution. Nice application. Cheers.

  • mark

    did you ever manage to fix the exception with array.copy ? i can’t seem to be able to fix it ..if you did, maybe you could post the fix here? i would really appreciate it a lot. i always get either source array or dest array too small in TrainModel()

blog comments powered by Disqus

5,979 views

© Copyright Phalanx Blogosphere - Powered by Wordpress - Phalanx logo is designed by Leopard Cana