Generating Syntactically Correct Sentences by Markov-Modeling Dependency Parse Trees

“we have begun the program and are sagging conceptually.”

I want to share with you a natural language generation experiment I’ve been working on lately. Natural language generation just means teaching computers to pretend they have something to say, then letting them try to say it. It’s a challenge because English grammar is COMPLICATED… but the results can be fun, even poetic.

The mechanism I came up with learns to generate the structure of a sentence first, then randomly fills it in with words that fit that structure.

This experiment lets the computer mix-‘n’-match words from even a small training text (twenty lines or so), while still generating English sentences that have a valid grammatical structure – but which still might not make sense.

My method is an improvement over other simple ngram Markov chain-based language models – the method behind the popular “ebooks”-style Twitter bots, which pick the next word in the sentence based only on the previous word or two. My method may also generate sentences with better syntax than more complicated neural language models, which, while they can do a better job keeping track of long-range dependencies, still don’t dependably generate syntactically-correct English.

basics

Effectively, we want the computer to learn that sentences have a subject and a verb, sometimes an object or two and often various adjectives, adverbs and prepositional phrases to modify the other constituents. Then, when the computer generates a sentence, it should know that, say, a prepositional phrase might be a preposition (in or under), an article (the or a), maybe an adjective (tall) and a noun (tree).

In contrast, a markov-chain language model, mid-sentence having already ended up with “under a tall” would only be predicting the word after “a tall”, so might choose “man”, then looking at just “tall man”, would choose “stands” and finish, ending up with “under a tall man stands”… which is ungrammatical..

This experiment won’t make that error, though it does make other errors. Take this real example of a generated sentence:

it discusses to get enough cold 2-by-4’s of time like the river, doing flips for an audience.

… and zoom in on this bit:

discusses to get enough

That doesn’t make sense, but the structure’s fine. Imagine a verb like tries there: tries to get enough makes total sense.

cold 2-by-4’s of time

That is probably not something anyone has ever said, but it’s poetic. And that’s the point.

generating new sentences, in detail

To generate a sentence, first we create a structure, then fill it in with words.

To create the structure, first we randomly chose a “root” for the sentence – usually some sort of verb. Let’s say we choose a past tense verb, which is represented as having the dependency role of ROOT and part of speech tag VBX (or past tense verb), which are combined into a single tag ROOT_VBX.

Our tree looks, for now, just like this:

ROOT_VBX

Using a slightly-modified markov chain model, we pick a set of children for ROOT_VBX from those that have occurred in our training corpus. Let’s say what’s chosen is nsubj_NN_left, dobj_NN_right, prep_IN_right and punct_.. (Tags also include left or right, since words may change their morphology if they’re a subject or an object.)

Then, we recursively pick children for each of those nodes, conditioned on the parent too; we’re asking what’s most likely to be the child or children of nsubj_NN_left whose parent is ROOT_VBX. In many cases there may be no children, in which case the node is terminal. Eventually we end up with a structure like this:

       ROOT_VBX 
   /       |     \      \
  /        |      \      \
nsubj     dobj    prep   punct
 NN        NN      IN      . 
left      right   right
                   |
                   |
                  pobj
                   NNS

Now that we have a structure, we’ve gotta fill it in. Taking inspiration from markov-chain language models, we choose a word based on the tag in the structure, the parent word and the left sibling, if any.

For the root, all we have is the tag, so we’re in essence picking a random past-tense verb that has occurred in our corpus, let’s say it comes out with see.

For nsubj_NN_left (the object of the sentence) we’re picking a word with that tag that has see as its parent (and no left sibling), like I. Conditioning on the parent guarantees us that a word like he won’t be chosen, since it doesn’t agree (i.e. he see is ungrammatical).

We do the same for dobj_NN_right, producing them.

In the middle of this process, here’s how the tree looks.

     see 
   /  |   \
  /   |    \
 I    them  _____ 
           prep_IN_right
              |
              |
            ________
            pobj_NNS_right

Now we have to fill in the prepositional phrase. To predict the head of the prepositional phrase (the structure dictates that it is prep_IN, i.e., a preposition), we randomly pick a word with a parent word see and a left sibling them that’s tagged prep_IN. More formally, we’re picking it from the weighted distribution of words that appeared in the training corpus. If we can’t find one, we try for a word tagged prep_IN that appeared in the training data with a word2vec-derived synonym of see and a left sibling them. (If there still isn’t one, we then back off to ignore the left sibling, then to ignore the parent word.)

In this case, the model did have an answer an answer: at.

     see 
   /  |    \
  /   |     \
 I    them   at 
             |
             |
           ________
           pobj_NNS

Then, having filled in that preposition, we’ve got to fill in the object of the preposition. A word with the tag pobj_NN, with no left-sibling and a parent at. The model has many choices here, but it picks exhibitions. Great.

Fill that in and we’re done filling in the structure. Linearizing the sentence gets us “I see them at exhibitions.”

In this way, we get sentences that are mostly grammatical, or close to it. What makes this often fun and sometimes beautiful is the way that, upon reading the sentence, the brain scrambles to figure out a context where it might make sense, where the sentence has actual meaning.

more examples

Here’s a few more sentences (curated):

the fluorite crystals of a hearty laughter by government’s avery fisher is fourth-quarter picocuries per earnings which patrol municipal roads of manute ontario,, sharing of a crippled presidency in a tradition of the lost only only time.

The errors it makes are more likely to be semantic rather than as clearly syntactic – sharing of doesn’t really make sense, though sharing in would. “only only time” would work fine with other adverbs (and in fact no longer after I added the left-sibling constraint).

Another fun snippet:

in times that try oriental vegetables.

do it yourself

All you need are 20 or more grammatical English sentences. Ideally they’re edited text: a higher-register English, punctuated, capitalized, etc. You may get better results if you use shorter sentences, rather than ones with lots and lots of embedding. (Much like humans, computers that try to write complicated sentences often end up with something confusing.)

pip install -r requirements.txt
python sentences_to_trees.py path/to/your/file.txt
python make_models.py path/to/your/file.txt
python markov.py path/to/your/file.txt

Much thanks to the folks at Explosion AI who create spaCy, which makes a lot of this doable.

the future.

Because the “machine learning” behind this technique is quite basic (markov chains), it’s quick to train, taking about 30 minutes on my laptop across 5.5M sentence corpus. A neural model would probably do better and I might try that out later.

I don’t have any systematic way to test the output though, which would require a way to distinguish novel but grammatical sentences from ungrammatical ones. (Trigrams overlapping with the corpus might work well, with appropriate cleaning for names. However, I think generating fake cities like “New Francisco” or “San Aviv” would actually be pretty funny and, thus, not an error,)

The problem – as with a lot of generative machine learning techniques – is that it requires striking the right balance between over-fitting and under-fitting, perfect fidelity and novelty in order to masquerade as real creativity.

Even if this technique is perfected, of course, it still faces the primary barrier to teaching computers to talk: they have nothing to say.

Code

It’s here.