“we have begun the program and are sagging conceptually.”
I want to share with you a natural language generation experiment I’ve been working on lately. Natural language generation just means teaching computers to pretend they have something to say, then letting them try to say it. It’s a challenge because English grammar is COMPLICATED… but the results can be fun, even poetic.
The mechanism I came up with learns to generate the structure of a sentence first, then randomly fills it in with words that fit that structure.
This experiment lets the computer mix-‘n’-match words from even a small training text (twenty lines or so), while still generating English sentences that have a valid grammatical structure – but which still might not make sense.
My method is an improvement over other simple ngram Markov chain-based language models – the method behind the popular “ebooks”-style Twitter bots, which pick the next word in the sentence based only on the previous word or two. My method may also generate sentences with better syntax than more complicated neural language models, which, while they can do a better job keeping track of long-range dependencies, still don’t dependably generate syntactically-correct English.
Effectively, we want the computer to learn that sentences have a subject and a verb, sometimes an object or two and often various adjectives, adverbs and prepositional phrases to modify the other constituents. Then, when the computer generates a sentence, it should know that, say, a prepositional phrase might be a preposition (in or under), an article (the or a), maybe an adjective (tall) and a noun (tree).
In contrast, a markov-chain language model, mid-sentence having already ended up with “under a tall” would only be predicting the word after “a tall”, so might choose “man”, then looking at just “tall man”, would choose “stands” and finish, ending up with “under a tall man stands”… which is ungrammatical..
This experiment won’t make that error, though it does make other errors. Take this real example of a generated sentence:
it discusses to get enough cold 2-by-4’s of time like the river, doing flips for an audience.
… and zoom in on this bit:
discusses to get enough
That doesn’t make sense, but the structure’s fine. Imagine a verb like tries there: tries to get enough makes total sense.
cold 2-by-4’s of time
That is probably not something anyone has ever said, but it’s poetic. And that’s the point.
generating new sentences, in detail
To generate a sentence, first we create a structure, then fill it in with words.
To create the structure, first we randomly chose a “root” for the sentence – usually some sort of verb. Let’s say we choose a past tense verb, which is represented as having the dependency role of
ROOT and part of speech tag
VBX (or past tense verb), which are combined into a single tag
Our tree looks, for now, just like this:
Using a slightly-modified markov chain model, we pick a set of children for
ROOT_VBX from those that have occurred in our training corpus. Let’s say what’s chosen is
punct_.. (Tags also include
right, since words may change their morphology if they’re a subject or an object.)
Then, we recursively pick children for each of those nodes, conditioned on the parent too; we’re asking what’s most likely to be the child or children of
nsubj_NN_left whose parent is
ROOT_VBX. In many cases there may be no children, in which case the node is terminal. Eventually we end up with a structure like this:
ROOT_VBX / | \ \ / | \ \ nsubj dobj prep punct NN NN IN . left right right | | pobj NNS
Now that we have a structure, we’ve gotta fill it in. Taking inspiration from markov-chain language models, we choose a word based on the tag in the structure, the parent word and the left sibling, if any.
For the root, all we have is the tag, so we’re in essence picking a random past-tense verb that has occurred in our corpus, let’s say it comes out with see.
nsubj_NN_left (the object of the sentence) we’re picking a word with that tag that has
see as its parent (and no left sibling), like I. Conditioning on the parent guarantees us that a word like
he won’t be chosen, since it doesn’t agree (i.e. he see is ungrammatical).
We do the same for
dobj_NN_right, producing them.
In the middle of this process, here’s how the tree looks.
see / | \ / | \ I them _____ prep_IN_right | | ________ pobj_NNS_right
Now we have to fill in the prepositional phrase. To predict the head of the prepositional phrase (the structure dictates that it is
prep_IN, i.e., a preposition), we randomly pick a word with a parent word see and a left sibling them that’s tagged
prep_IN. More formally, we’re picking it from the weighted distribution of words that appeared in the training corpus. If we can’t find one, we try for a word tagged prep_IN that appeared in the training data with a word2vec-derived synonym of see and a left sibling them. (If there still isn’t one, we then back off to ignore the left sibling, then to ignore the parent word.)
In this case, the model did have an answer an answer: at.
see / | \ / | \ I them at | | ________ pobj_NNS
Then, having filled in that preposition, we’ve got to fill in the object of the preposition. A word with the tag
pobj_NN, with no left-sibling and a parent at. The model has many choices here, but it picks exhibitions. Great.
Fill that in and we’re done filling in the structure. Linearizing the sentence gets us “I see them at exhibitions.”
In this way, we get sentences that are mostly grammatical, or close to it. What makes this often fun and sometimes beautiful is the way that, upon reading the sentence, the brain scrambles to figure out a context where it might make sense, where the sentence has actual meaning.
Here’s a few more sentences (curated):
the fluorite crystals of a hearty laughter by government’s avery fisher is fourth-quarter picocuries per earnings which patrol municipal roads of manute ontario,, sharing of a crippled presidency in a tradition of the lost only only time.
The errors it makes are more likely to be semantic rather than as clearly syntactic – sharing of doesn’t really make sense, though sharing in would. “only only time” would work fine with other adverbs (and in fact no longer after I added the left-sibling constraint).
Another fun snippet:
in times that try oriental vegetables.
do it yourself
All you need are 20 or more grammatical English sentences. Ideally they’re edited text: a higher-register English, punctuated, capitalized, etc. You may get better results if you use shorter sentences, rather than ones with lots and lots of embedding. (Much like humans, computers that try to write complicated sentences often end up with something confusing.)
pip install -r requirements.txt python sentences_to_trees.py path/to/your/file.txt python make_models.py path/to/your/file.txt python markov.py path/to/your/file.txt
Because the “machine learning” behind this technique is quite basic (markov chains), it’s quick to train, taking about 30 minutes on my laptop across 5.5M sentence corpus. A neural model would probably do better and I might try that out later.
I don’t have any systematic way to test the output though, which would require a way to distinguish novel but grammatical sentences from ungrammatical ones. (Trigrams overlapping with the corpus might work well, with appropriate cleaning for names. However, I think generating fake cities like “New Francisco” or “San Aviv” would actually be pretty funny and, thus, not an error,)
The problem – as with a lot of generative machine learning techniques – is that it requires striking the right balance between over-fitting and under-fitting, perfect fidelity and novelty in order to masquerade as real creativity.
Even if this technique is perfected, of course, it still faces the primary barrier to teaching computers to talk: they have nothing to say.