Random Name-Generation Using Grammars

Overview

I got the idea for this topic from reading the article "Algorithms for an infinite Universe" by Guy W. Lecky-Thompson on Gamasutra.com [1]. He describes the advantage of a game-enviroment like stars and terrain, which is not designed when the game was developed but is generated randomly. This saves disk space, provides the opportunity of travelling through an almost infinite universe and, above all, the stories of every game can differ.

In particular he describes a popular technique to randomly generate names for the universe known as the Markovian List [2]. In essence, this involves examining the frequency of repeated elements in a given number of sequences and using this as a basis for generating new sequences. The results of this technique may be satisfying for naming stars but if you think of character names in a role-playing game some don't seem pronounceable:

Amiteq Ntim Thinct Ceeriqu Fran Onddef Itha Vathmp Winca Frri ...

I will now demonstrate a method to build reasonable names using so-called grammars, a topic of the Theoretical Informatics [3]. The result will be randomly generated names with pronouncibility guaranteed, like the following:

Hirema Elanis Serela Ebifah Alanat Relame Latini Etafok Efajet ...

The Method

Using grammars to describe rules

To deal with infinite objects (in our case names) algorithmically we need a way to describe them in a finite manner. Grammars are one way to do this. A Grammar is a 4-tuple G=(V,\Sigma ,P,S) that meets the following conditions: V is a finite set of the variables; \Sigma is a finite set, the termination alphabet; P is a finite set of production rules; S \in V is the starting variable.

In the case of building up a grammar for names the termination symbols \Sigma are the letters of the alphabet, V are the variables we need on the way from starting variable S to the resulting name. The possible combinations are given by the production rules P, which are responsible for the grammar resulting in reasonable names. For example in the German language a sequence of letters is pronounceable if a vowel (a, e, i, o, u) is always followed by a consonant (rest of alphabet) and contrary. So the production rules to build names pronounceable in German could look like this:

S     -> <Vow><A>
S     -> <Con><B>
A     -> <Con><B>
B     -> <Vow><A>

<Vow> -> a
<Vow> -> e
<Vow> -> i
<Vow> -> o
<Vow> -> u

<Con> -> b
<Con> -> c
...
<Con> -> z


If you start with <S> it may be replaced by <Vow><A> or by <Con><B>. The variable <Vow> may again be replaced by one of the vowels and <A> by <Con><B>... The result will be an endless sequence of vowels following on consonants and contrary.

Let accident decide

The possibilities are now defined by the Grammar, but we don't know when to go which way. If you look at the variable <Vow> it could be replaced by any vowel. So what we need is the probability for the occurence of every specific rule (substitution). The sum of probabilities of the rules for a given identifier must be 1. You can obtain the probability of each letter with a little program that counts its frequency of occurence in a sequence of given names. If the junctions in your grammar are simple then calculating the probabilities shouldn't be a problem:

S     -> <Vow><A> 0.5
S     -> <Con><B> 0.5
A     -> <Con><B> 1.0
B     -> <Vow><A> 1.0

Vow   -> a        0.335
Vow   -> e        0.305

Vow   -> i        0.22
Vow   -> o        0.10
Vow   -> u        0.04

Con   -> b        0.02
Con   -> c        0.03
...
Con   -> z        0.001


This grammar is very simple; there are no exceptions and it won't be able to produce every name which is pronounceable in German but the results are very satisfying and you are free to modify the grammar to include rules for more complicated names.

The rules allowing a 'ch' between two vowels, for example, would look like this:

S     -> <Vow><A> 0.5
S     -> <Con><B> 0.5
A     -> <Con><B> 0.97
A     -> ch<B>    0.03
B     -> <Vow><A> 1.0
...


Implementation

It makes sense to place the production rules in a file so they can be changed easily. The rules are read out and placed in a list of CProdRule objects. The left side of the rule, the identifier variable, is stored in sVariable, the right side in sReplace and the probability in dP.

class CProdRule {
public:
string sVariable;
string sReplace;
double dP;
};


The method GenerateString() needs the list of production rules and a string reference which will be traversed from begin to end. If an identifier variable is found a random number between 0 and 1 will be generated, the list of production rules will be traversed, and if a production rule matching to this identifier is found and the random number is below the rule's probability dP then the variable is replaced by the string sReplace. Otherwise the probability of the rule is subtracted from the random number and the list is traversed for the next rule matching the variable. The sum of all probabilities for an identifier variable must be 1 or this won't work.

If *lstRules is a Null-pointer the found variables will only be extracted and not be replaced.

void GenerateString( string& sStr, list<CProdRule>* lstRules ) {
string sBuf;

istringstream istrm( sStr );
sStr.clear();

while (istrm.peek() && !istrm.eof()) {
getline( istrm, sBuf, '<' );
sStr += sBuf;
getline( istrm, sBuf, '>' );

if (lstRules) {
double r = (double)(rand()%1000)/1000;

for (list<CProdRule>::const_iterator iter = lstRules->begin();
iter != lstRules->end(); iter++) {
if ((*iter).sVariable == sBuf) {
if (r < (*iter).dP) {
sStr += (*iter).sReplace;
break;
}
r -= (*iter).dP;
}
}
}
}
}


An example usage would be calling GenerateString() a few times with a string argument initialized to "<S>" and the list of production rules as its arguments, followed by calling GenerateString() once with the same string and a null pointer. The string has all identifier variables from the grammar substituted by valid letter sequences according to the production rules.

References

[1] Lecky-Thompson, Guy W., "Algorithms for an infinite Universe", www.Gamasutra.com, 1999

[2] Dewdney, A.K., "The Tinkertoy Computer", W.H. Freeman, 1990

[3] Prof. Dr. Schoening, Uwe, "Boolsche Funktionen, Logik, Grammatiken und Automaten (Theoretische Informatik II)", Vorlesungsskript, Universitaet Ulm, 2003