Recognition of Handwritten Gestures
by Oleg Dopertchouk

Did you play the famous Black & White by Lionhead Studion? Remember the magic glyph system, where you could cast a spell by simply drawing a symbol in the air? Or if you did not play that game, imagine being able to control your game by drawing. Draw a snowflake - and your heroes are protected by an impenetrable Wall Of Ice. Draw another magical symbol - and you summon a mighty demon from The Lower Planes. No menus or other unrealistic interface gimmicks, what you see is what you get. Or do you really think that actual mages employ pop-up menus in their spellcasting?

Just a few strokes

So how do you do it? How do you teach a computer to recognize human handwriting? Well, the general task is very complex one, that sometimes confounds even us humans (recall that note your doctor asked you to bring to the pharmacist!). But we don't want to recognize each and every handwritten note under the Sun, we simply want to recognize a few simple figures (also known as gestures) of our own design. This is a way, way simpler task, a task that can actually be handled by a computer in real-time!

First of all we need to create and store the master templates for our gestures. We'll choose the simplest possible way: as a sequence of strokes, each of which is a simple sequence of 2D points. Something like that:

#include <list>
using namespace std;

struct CPoint2D
{
   float x,y;
};

typedef list<CPoint2D> CStroke;
typedef list<CStroke> CGesture;

(Note that I used STL containers that store actual values, as opposed to pointers. This is done only for the purpose of exposition. In the actual production code you should probably use pointer lists - this will save you a lot of time wasted on redundant copying)

Then we can write a simple program to input the gesture with a mouse and store it in a database (I'll leave it to you as a home exercise.)

But now we run into a following problem: some people write small, some people write big, some people write slowly and carefully, some people draw with fast sweeping motions, some people will write in the center of the screen, some people will write somewhere in the corner. The end result would look pretty much the same to us - a gesture is a gesture is a gesture - but for the computer these would look like completely unrelated sets of coordinates. What does it mean? That means we need to normalize the strokes.

Making things look normal

The normalization proceeds in 3 stages:

First we need to scale thegesture to a predetermined size, say 1 unit
Second, we need to put the individual points in the gesture at a uniform distance, so that it doesn't matter how fast or slow the gesture was drawn.
Finally, we need to center the gesture around (0.0).

The first step is pretty straightforward: we just find the dimensions of the gesture and then scale accordingly. For instance it can go like this:

#include <cfloat>

void NormalizeSize(CGesture& gesture)
{
   float minX = FLT_MAX;
   float maxX = -FLT_MAX;
   float minY = FLT_MAX;
   float maxY = -FLT_MAX;

   //Calculate extents of the gesture
   CGesture::iterator i;
   CStroke::iterator j;
   for ( i = gesture.begin(); i != gesture.end(); ++i )
   {
      CStroke& stroke = *i;
      for ( j = stroke.begin(); j != stroke.end(); j++ )
      {
         CPoint2D& pt = *j;
         if ( minX > pt.x ) minX = pt.x;
         if ( minY > pt.y ) minY = pt.y;
         if ( maxX < pt.x ) maxX = pt.x;
         if ( maxY < pt.y ) maxY = pt.y;
      }
   }

   //Calculate dimensions 
   float width = maxX - minX;
   float height = maxY - minY;


   //find out scale
   float scale = (width > height) ? width:height;
   if ( scale <= 0.0f ) return; //Empty or a single point stroke!
   scale = 1.0f/scale;

   //Do the actual scaling
   for ( i = gesture.begin(); i != gesture.end(); ++i )
   {
      CStroke& stroke = *i;
      for ( j = stroke.begin(); j!=stroke.end(); ++j )
      {
         CPoint2D& pt = *j;
         pt.x *= scale;
         pt.y *= scale;
      }
   }
}

Now we move to a somewhat tricky part. We need to go through each stroke and figure out how close its points are. If they crowd too close, we need to space them out. If they stand too far apart, we need to insert new ones. At the end we want each stroke to contain a fixed number of points (say, 32), all equally spaced apart. For this we need to measure the length of the stroke, then walk along it, marking (adding a point to a new stroke) every 1/31 of the length. Here's the code:

#include <cmath>
          
float GetStrokeLength(const CStroke& stroke)
{
   if ( stroke.size() <= 1 ) return 0;
   float len = 0.0f;

   CStroke::const_iterator i = stroke.begin();
   CPoint2D startPt = *i;

   ++i;
   while ( i != stroke.end() )
   {
      CPoint2D endPt = *i;
      float dx = endPt.x - startPt.x;
      float dy = endPt.y - startPt.y;
      //Add the length of the current segment to the total
      len += sqrtf(dx * dx + dy * dy);
      startPt = endPt;
      ++i;
   }
   return len;
}

const int kPointsPerStroke = 32;

void NormalizeSpacing(CStroke& newStroke, const CStroke& oldStroke)
{
   //Clear the new stroke
   newStroke.erase(newStroke.begin(), newStroke.end());
   float newSegmentLen = GetStrokeLength(oldStroke);
   if ( oldStroke.size() <= 1 || newSegmentLen <= 0.0f ) return;
   newSegmentLen /= (kPointsPerStroke-1);
   CStroke::const_iterator i = oldStroke.begin();

   //Add the first point to the new stroke
   newStroke.push_back(*i);
   CPoint2D startPt = *i;  //Ends of the current segment
   CPoint2D endPt = *i;    //(begin with the empty segment)
   ++i;

   //Distance along old stroke at the end of the current segment
   float endOldDist     = 0.0f;
   //Distance along the old stroke at the beginning of the current segment
   float startOldDist   = 0.0f;
   //Distance along new stroke
   float newDist        = 0.0f;
   //Length of the current segment (on the old stroke)
   float currSegmentLen = 0.0f;

   for(;;)
   {
      float excess = endOldDist - newDist;
      //we have accumulated enough length, add a point
      if ( excess >= newSegmentLen )
      {
         newDist+=newSegmentLen;
         float ratio = (newDist - startOldDist)/currSegmentLen;
         CPoint2D newPt;
         newPt.x = ( endPt.x - startPt.x ) * ratio + startPt.x;
         newPt.y = ( endPt.y - startPt.y ) * ratio + startPt.y;
         newStroke.push_back(newPt);
      } else {
         if ( i == oldStroke.end()) break; //No more data

         //Store the start of the current segment
         startPt = endPt;
         endPt = *i; //Get next point
         ++i;
         float dx = endPt.x - startPt.x;
         float dy = endPt.y - startPt.y;

         //Start accumulated distance (along the old stroke)
         //at the beginning of the segment
         startOldDist = endOldDist;
         //Add the length of the current segment to the
         //total accumulated length
         currSegmentLen = sqrtf(dx*dx+dy*dy);
         endOldDist+=currSegmentLen;
      }
   }
   //Due to floating point errors we may miss the last
   //point of the stroke
   if ( newStroke.size() < kPointsPerStroke )
   {
      newStroke.push_back(endPt);
   }
}

Finally, we want to center the gestures at the origin (0,0). This is very easy: we calculate the center of the gesture using the standard arithmetic-mean technique (add up all coordinates and divide the sum by the number of points). Then we move each point of the gesture by (-centerX,-centerY). This will, obviously, bring the center right into the origin.

void NormalizeCenter(CGesture& gesture)
{
   float centerX = 0.0f;
   float centerY = 0.0f;
   int pointCount = 0;

   //Calculate the centroid of the gesture
   CGesture::iterator i;
   CStroke::iterator j;
   for ( i = gesture.begin(); i != gesture.end(); ++i )
   {
     CStroke& stroke = *i;
     //size should always be == kPointsPerStroke
     pointCount += (int)stroke.size();
     for ( j = stroke.begin(); j != stroke.end(); j++ )
     {
       CPoint2D& pt = *j;
       centerX += pt.x;
       centerY += pt.y;
     }
   }
   //Calculate centroid
   if ( pointCount <= 0 ) return; //empty gesture
   centerX /= pointCount;
   centerY /= pointCount;

   //To move the gesture into the origin, subtract centroid coordinates
   //from point coordinates

   for ( i = gesture.begin(); i != gesture.end(); ++i )
   {
      CStroke& stroke = *i;
      for ( j = stroke.begin(); j != stroke.end(); ++j )
      {
         CPoint2D& pt = *j;
         pt.x -= centerX;
         pt.y -= centerY;
      }
   }
}

Now that we have our gestures nicely scaled, centered and arranged in a fixed number of points, we can do our main magic trick: shape matching.

Shape Up!

How would we like our magic shape matcher to work? I'd be nice if it would compare the gesture, drawn by user to one of the stored gestures, then it'd return a score: say, 1 for a perfect match, a number, slightly less than 1 for a similar shape and a small or even negative number for a very different shape. We can then compare the input to each of the gestures in our database, pick the one with the highest ranking, and if the ranking is high enough, we say "That's what it is!"

OK, it's easy to say, but how do you actually do it? Well, look closer at the requirements for the shape matcher, does it remind you of something? No? How about good old dot product? If you give it two normalized vectors it'll return 1 if the vectors are exactly the same, it will return a value slightly less than 1 for vectors that point in more-or-less same direction and it'll return a low or even negative number for vectors that point in wildly different (or opposite) directions. Amazingly enough, it works just as well for shape matching! The only difference is that our vectors are not 2 or 3 dimensional, but 2*N dimensional, where N is the number of points in the gesture. Otherwise it's the same familiar x1*x2 + y1*y2 +… (Professional statisticians call this formula correlation; now you know that's the good old dot product in disguise)

So how exactly do you use dot-product to compare shapes? You simply unroll each gesture into an array of numbers and manipulate these arrays as if they were familiar geometrical vectors. (What if the arrays turn out to have different dimensions? Well, then you know it is not a good match, for differing dimensions mean different number of points, and a different number of points means different number of strokes in corresponding gestures, which allows us to reject the match right away. This is why we made each stroke precisely 32 points long.)

float GestureDotProduct(const CGesture& gesture1, const CGesture& gesture2)
{
   if ( gesture1.size() != gesture2.size() ) return -1;

   CGesture::const_iterator i1;
   CGesture::const_iterator i2;
   float dotProduct = 0.0f;

   for ( i1 = gesture1.begin(), i2 = gesture2.begin();
         i1 != gesture1.end() && i2 != gesture2.end();
         ++i1, ++i2 )
   {
      const CStroke& stroke1 = *i1;
      const CStroke& stroke2 = *i2;
      if ( stroke1.size() != stroke2.size() ) return -1;

      CStroke::const_iterator j1;
      CStroke::const_iterator j2;
      for ( j1 = stroke1.begin(), j2 = stroke2.begin();
            j1 != stroke1.end() && j2 != stroke2.end();
            ++j1, ++j2 )
      {
         const CPoint2D& pt1 = *j1;
         const CPoint2D& pt2 = *j2;
         dotProduct += pt1.x * pt2.x + pt1.y * pt2.y;
      }
   }

   return dotProduct;
}

float Match(const CGesture& gesture1, const CGesture& gesture2 )
{
   float score = GestureDotProduct(gesture1,gesture2);
   if ( score <= 0.0f ) return 0.0f; //No match for sure
   //at this point our gesture-vectors are not quite normalized
   yet - their dot product with themselves is not 1.

   //we normalize the score itself

   //this is basically a version of a famous formula for a cosine of the 
   //angle between 2 vectors:
   //cos a = (u.v) / (sqrt(u.u) * sqrt(v.v)) = (u.v) / sqrt((u.u) * (v.v))
   score /= sqrtf(GestureDotProduct(gesture1, gesture1) *
                  GestureDotProduct(gesture2, gesture2));
   return score;
}

Note: In real life, we don't have to calculate GestureDotProduct()with itself for all the template gestures in the database. We can easily pre-calculate and store these values.

Conclusion

This is about it. The rest is database management, user interface and other maintenance tasks that you can easily provide yourself (and which will vary greatly from game to game anyway). If you wish to play around with an already working gesture recognizer, you can download one)

The algorithm is fairly simple and flexible, it even can be easily trained to recognize individual letters and digits. However it is by no means universal - it will not recognize cursive handwriting, it'll get confused by symbols with a variable number of strokes, such as "t", sometimes it'll even have hard time telling a circle from a square (this is because a circle drawn starting from the top and a circle drawn starting from the bottom look like entirely different shapes to the computer). Such is the price of simplicity.

[This spatial dependency can probably be overcome by sorting the gesture-vectors into clockwise or counter-clockwise order from a specific position, such as from top-right. Overcoming variable-stroke gestures such as “t” would require more work, but a spatial sort can at least improve reliability on symmetric shapes. - Ed.]

Nevertheless, I believe that this technique can be used to make you game interface more immersive, more intuitive and overall more fun. Which is always a Good Thing™.

Discuss this article in the forums

Date this article was posted to GameDev.net: 1/9/2004
(Note that this date does not necessarily correspond to the date the article was written)

See Also:
General
Sweet Snippets