Generating Text

In this final part of the lab, we will use our MyHashMap to implement a n-order Markov chain model that learns how to generate new text after training on existing text from real authors.

Application Overview

In particular, we will extend our idea of a first-order Markov chain developed in the DNASequence.java file in the Warmup to work for any n-order Markov chain model. Recall that an n-order Markov chain learns the probability of each possible next state (e.g., a character of text) based on the history of the n most recent states (e.g., the most recent String of n characters). After learning these probabilities, we can then use them to randomly generate new text that will hopefully look like it was written by a person!

We have provided you with example text files in the samples folder of your GitHub repository. Example runs of the program could generate the following text:

Example Text with Order = 10, Length = 1000, Training = shakespeare.txt
THE COMEDY OF ERRORS
ACT I. Scene I.
Elsinore. hall in the palace.
Flourish. Exeunt.
SCENE IV.
A Hall in the Castle. Another part of the world, I give to negligence,
    And hear at large discourse to your friends, I'll leave you till bed time.
  My present business and desires shall pay the sum for him,
  He shall kill two of vs, and men are
onelie turned into tongue, will speak to you like of Paris loue?
  Iul. No Madam, we haue cul'd such necessaries are embark'd. Farewell.            Exit [Polonius].
                               Exit.
  Ham. Good.  
  King. But where's the commission; read it at more leisure,
  And now no soil nor cautel doth besmirch
    The very firstlings of my wife withal;
  And so of these. Which is the natural touch; for the wealth that he hath lam'd me,
  I shall be cal'd assurance double sure,
    Still am I call'd. Unhand me, gentlemen, he hath bid me to interpret between you and your behests, and am enioyn'd
By holy Lawrence, to fall prostrate he

Getting Started

We have provided you with some starter code in the MarkovChain.java file. Your objective in this part of the lab is to fill in the missing pieces so that you can generate text similar to the above.

The first step is to add two instance variables to the MarkovChain class:

  1. transitions, which is a MyHashMap<String, Transitions<Character>> that will store the learned transition counts

  2. order, an integer storing the order n of the model

The class has a single constructor whose instructions need to be filled in: you should save the order parameter into the this.order instance variable, and you should create a new MyHashMap for the this.transitions instance variable.

ReadMe

To earn full credit, you should use your MyHashMap to complete this part of the lab. However, if your MyHashMap is not quite working, you can still earn partial credit by using Java’s built in HashMap class since one of the goals of this part is to gain practice using maps to solve interesting real-world problems.

The train Method

Similar to the Warmup’s trainCounts method, you should implement the logic for training the Markov chain transition probabilities in the train and updateCount methods.

Your train method should loop through the provided String of input text to find all history and nextCharacter pairs. Start with using the first n = this.order characters as history and the following character as nextCharacter and pass these two arguments into the updateCount method.

Hint

The String classes’ substring and charAt methods should be really useful here.

To illustrate, imagine that input = “The quick brown fox jumps over the lazy dog” and this.order = 5. To start,

  • history = “The q” = input.substring(0, 5)
  • nextCharacter = ‘u’ = input.charAt(5).

In the second iteration shift everything over to the right by one position in the input String. In our illustrative example:

  • history = “he qu” = input.substring(1, 6)
  • nextCharacter = ‘i’ = input.charAt(6).

In the third iteration, shift over to the right by one more position in the input String. In our illustrative example:

  • history = “e qui” = input.substring(2, 7)
  • nextCharacter = ‘c’ = input.charAt(7).

The updateCount Method

Inside the updateCount method, you should increase the transition count for observing nextCharacter within the Transitions<Character> stored for history in the this.transitions MyHashMap

The generate Method

Finally, after we have learned the transition probabilities, we want to be able to generate new text, similar to the Warmup’s generate method. The process here is very similar, except we start the new String with all of the characters in the start parameter. Also, each time through the loop used to generate the next character, we update current to be the last n = this.order characters of the generated String (again, the substring method from the String class could be helpful here).

Experimenting!

Once you have all of the functionality implemented, try experimenting with different parameters for the three command line arguments: (1) the order of the model, (2) the length of the String to generate, and (3) the path to the file you want to use for training.

Especially try experimenting with different orders of the model. What trends do you observe as the order becomes larger and smaller?

Committing Your Progress

Don’t forget to frequently save your progress to GitHub using the add/commit/push commands with git.