Using Text-To-Speech as a Game Programming Tool
The purpose of this article is to introduce the use of Microsoft's Speech API (SAPI) 5.1 as an effective tool in game development. It is necessary to have the SAPI 5.1 SDK installed, and have the library/header paths set in VC++. An understanding of C++, object-oriented programming, and Win32 programming is recommended. SAPI 5.1 is a COM based API, and so an understanding of COM would be an asset, but is not essential.
Note: This article will be using Visual C++ 6.0, on the Win32 platform. The source code for this article is available at the bottom of this page.
The first thing I should tell you is my reason for writing this article. When I sobered up this morning - and inevitably started coding - I was quite annoyed, because I had to keep typing in the message box function, and lots of string formatting garbage. The repetition - without production - was really getting to me. So then, I thought to myself: "Hey, wouldn't it be cool if I could make the computer talk to me instead?" Later on, I realized that it would also save me a lot of time, since all of the string formatting could be done transparently, and I would no longer have to hit 'Enter' whenever a message appeared. Or did I think of it yesterday? It's all a real blur…
Anyway, all joking aside, most games are Win32 applications, and if you've ever programmed in Win32, the first thing you notice is that you don't have the luxury of console input and output. This, to say the least, is not going to help… ever. Now, you could just stick with the tried and true message boxes, but, as I've already said, that can get tedious. The way I see it, this situation is analogous to the difference between printf() and cout. They are both fine for displaying intrinsic types, but cout allows you to output all sorts of data, such as the members of a class (thanks to operator overloading) with ease. Using printf() is also fairly easy, but it is not nearly as versatile. This is similar to the message box versus Text-To-Speech problem, because message boxes can only output text, and you have to do string formatting to output other data types. Now, as the name suggests, a Text-To-Speech engine also has this restriction, but if we put the engine in a class, we can deal with converting all data types using overloaded operators - behind the scenes. The end result is a single class object that is very flexible and easy to use, and that can be employed anywhere in lieu of a message box.
I suppose you could encapsulate a message box in a similar fashion, but, on principle, I refuse to waste a Saturday morning on something so trivial - let alone write my first article about it - when there is something much more useful available. Besides, I think making my computer talk is much more entertaining.
Accompanying this article are two samples. The first is a simple "Hello World!!!" application, and the other is a complete class, that is meant to behave in a similar manner as the cout object. For the article itself, I will be focusing on the "Hello World!!!" application, since it is the simpler of the two. The basic concepts are the same, and so my intention is that you will go through the article, the first sample, and then understand the class (second sample) without difficulty.
I will tell you now that this is a very easy subject to pick up, and I'm quite surprised that nobody has written about it yet. Microsoft has put a lot of time and effort into developing this technology, and it would be foolish for us to not at least consider using it. There are many possible applications of Text-To-Speech engines in modern game development, but using it as a debugging tool is just the first one that I thought was worth writing about. A "Hello World!!!" application, such as the one described in this article, could be made as small as ten executable lines. All of the code for the output class is no more than 500 lines (due to white space, commenting, etc.), and I have documented the code to make it as clear as possible. There are also many tutorials and whitepapers in the SDK documentation (some of which are even shorter than this article).
I'm not going to try to explain every possible use of SAPI to you, but I do hope to spark your interest in it. Maybe I'm off in Never Never Land, but I just think this is neat.
Now, let's get to it. The first step, in any application, is to link the needed libraries, and include the header files. As it happens, SAPI doesn't have any libraries that you need to link manually, but there is one header file, "sapi.h", which must be included. The second step is to initialize COM, and the voice interface. This is shown below:
#include <sapi.h> // The voice interface pointer IspVoice* Voice = NULL; // Initialize COM CoInitialize ( NULL ); // Create the voice instance CoCreateInstance ( CLSID_SpVoice, NULL, CLSCTX_ALL, IID_ISpVoice, (void**)&Voice );
The first two lines are pretty straightforward. They initialize the interface pointer and COM (the parameter is reserved, and must be NULL). The third line has a few parameters, which are explained in the below:
Initialization is just that simple. The next task is making the voice interface speak to us.
Now that the voice interface is initialized, we can use it to speak. The following code accomplishes this:
// Our text to be spoken WCHAR* TextBuffer = "Hello World"; // Use our voice interface to speak the contents of the buffer Voice -> Speak ( TextBuffer, SPF_DEFAULT, NULL );
You'll notice that our string is in wide characters. To convert between wide characters and ASCII character, use the "MultiByteToWideChar" function in the Win32 API. For more information, please refer to the MSDN Library.
At this point, the program will suspend and you should hear the contents of the buffer being spoken by the computer's default voice. Once the voice has finished speaking, the program continues normally. The parameters for the "Speak" member function are as follows:
The last thing to do is shutdown the application. The following lines accomplish this:
// Safely release the voice interface if ( Voice != NULL ) Voice -> Release (); Voice = NULL; // Shutdown COM CoUninitialize ();
The first line determines if the interface is in use, and if so, it is released. The pointer is then set to NULL, just to be safe. The second line shuts down COM.
Now, as I said at the beginning, this is a "Hello World!!!" program, and consists of about ten lines of code, the bulk of which we have just looked at. You should be able to go through the samples, and, almost immediately, implement it in the debugging code of your games. When you build the samples, you will notice that the speech quality is rather low, more specifically, words sometimes sound distorted or choppy. I presume this is just the nature of SAPI, and will improve in future versions. However, this does not, in any way, stop us from using TTS for debugging and testing.
This is only the first application of TTS in game development. In the future, you may wish to implement this technology in the actual game. A few ideas I'm turning over include speaking into a DirectSound or OpenAL buffer, then rendering in 3D space. Alternatively, one could use SAPI to speak into a wave file, and then simply use the file, as any regular sound effect. But that's for the future…
Anyway, I hope that I've been able to you teach something useful. This is the first article that I have written, and I would appreciate any feedback you may wish to offer. Thanks.