GameDev.netBeat Detection Algorithms

Beat Detection Algorithms
## DisclaimerThis document is to be distributed for free and without any modification from its original state. The author declines all responsibility in the damage this document or any of the things you will do with it might do to anyone or to anything. This document and any of its contents is not copyrighted and is free of all rights, you may thus use it, modify it or destroy it without breaking any international law. However according to the author's will, you may not use this document for commercial profit directly, but you may use indirectly its intellectual contents; in which case I would be pleased to receive a mail of notice or even thanks. This is my first tutorial and I am still a student, you must assume that this document is probably not free of small errors and bugs. In the same state of mind, those algorithms are not fully optimised, they are explained for pedagogical purposes and you may find some redundant computations or other voluntary clumsiness. Please be indulgent and self criticise everything you might read. Hopefully, lots of this stuff was taken in sources and books of reference; as for the stuff I did: it has proven some true efficiency in test programs I made and which work as wanted. As said in the introduction: If you have any question or any comment about this text, please send it to the above email address, I'll be happy to answer as soon as possible. ## IntroductionSimulating a physical phenomena which obeys to known mathematical equations is, with a number of approximations, always feasable. But what about more abstract concepts, such as feelings, which do not follow any laws? The simplest things we can feel are often the hardest things to capture in a program. Beat detection follows this rule : feeling the beat of a song comes naturally to humans or animals. Indeed it is only a feeling one gets when listening to a melody, a feeling which will make you dance in rhythm or hit a table with your hands on the melody beats. Therefore, how can we teach this beat detection to a machine that can only compute logical operations? In fact there are a number of algorithms which manage to approximate, more or less accurately, this beat detection. We will first study the statistical approach of beat detection on a streaming source and secondly a filtering approach of rhythm extraction on a static song. This guide assumes the reader has basic signal processing understanding (FFT, convolutions and correlations should sound common) maybe some stuff in statistics will also help (Variance, Average, Principal Components Analysis, will be quoted among others). The point here is not to actually write the code of these algorithms, but more to understand how they work and to be able to adapt or create the appropriate algorithm to a situation. If you have a question or a comment about this text, please send it to the above email address, I'll be happy to answer as soon as possible. Anyway, the aim here is to give more precise ideas on the subject of beat detection to the reader. Enjoy. ## I Statistical streaming beat detection## 1 Simple sound energy## a - A first analysisThe human listening system determines the rhythm of music by detecting a pseudo periodical succession of beats. The signal which is intercepted by the ear contains a certain energy, this energy is converted into an electrical signal which the brain interprets. Obviously, The more energy the sound transports, the louder the sound will seem. But a sound will be heard as a In this model we will detect sound energy variations by computing The history buffer where we will keep the last 44032 samples wil contain in fact too lists of samples (B[0]) and (B[1]) corresponding to the left (an) and to the right (bn) channels history.
- Use the 1024 new samples taken in a[n] and b[n] to compute the instant energy 'e', using the following formula (i0 is the position of the 1024 samples to process):
- Compute the local average energy '<E>' on the 44100 samples of a history buffer (B):
- Shift the 44032 history buffer (B) of 1024 indexes to the right so that we make room for the 1024 new samples and evacuate the oldest 1024 samples.
- Move the 1024 new samples on top of the history buffer.
- Compare 'e' to 'C * <E>' where C is a constant which will determine the sensibility of the algorithm to beats. A good value for this constant is 1.3. If 'e' is superior to 'C * <E>' then we have a beat!
## b - Some direct optimisationsThis was the basic version of the algorithm, its speed and accurecy can be improved quite easily. The algorithm can be optimised by
- Compute the instant sound energy 'e' on the 1024 new sample values taken in (an) and (bn) using the formula
**(R1)** - Compute the average local energy <E> with (E) sound energy history buffer:
- Shift the sound energy history buffer (E) of 1 index to the right. We make room for the new energy value and flush the oldest.
- Pile in the new energy value 'e' at E[0].
- Compare 'e' to 'C*<E>'.
## c - Sensitivity detectionThe imediate draw back of this algorithm is
- Compute the instant sound energy 'e' on the 1024 new samples taken in (an) and (bn) using the following formula
**(R1)**. - Compute the average local energy <E> with (E) sound energy history buffer using formula
**(R3)**. - Compute the variance 'V' of the energies in (E) using the following formula:
- Compute the 'C' constant using a linear degression of 'C' with 'V', using a linear regression with values
**(R5)**: - Shift the sound energy history buffer (E) of 1 index to the right. We make room for the new energy value and flush the oldest.
- Pile in the new energy value 'e' at E[0].
- Compare 'e' to 'C*<E>', if superior we have a beat!
Those three algorithms were tested with several types of music, among others : pop, rock, metal, techno, rap, classical, punk. The fact is the results are quite unpredictable. I will only talk about Clearly, To explain this phenomena lets say a guitare and flute make alternatively an amplitude constant note. Each time the first finishes the other starts. The note made by the guitare and the note made by the flute have the same energy but the ear detects a certain rhythm because the notes of the instruments are at different pitch. For our algorithm (who is one might say colorblind) it is just an amplitude constant noise with no energy peaks. This partly explains why the algorithm doesn't detect precisely beats in songs with a lot of instruments playing at different rythms and simultaneously. Our next analysis will make us walk through this difficulty. Comparing the results we have obtained with the ## 2 Frequency selected sound energy## a - The idea and the algorithmThe issue with our last analysis of beat detection is that it is colorblind. We have seen that this could raise quite a few problems for noisy like songs in rock or pop music. What we must do is give our algorithm the abbility to determine on which frequency subband we have a beat and if it is powerful enough to take it into account. Basically we will try to detect Okay that was just a bit of sport, lets go back to the mainstream; Here is how the Fast Fourier Transform (FFT). We will thus obtain a 1024 frequency spectrum. We then divide this spectrum into however many subbands we like, here I will take 32. The more subbands you have, the more sensitive the algorithm will be but the harder it will become to adapt it to lots of different kinds of songs. Then we compute the sound energy contained in each of the subbands and we compare it to the recent energy average corresponding to this subband. If one or more subbands have an energy superior to their average we have detected a beat.The great progress with the last algorithm is that we now know more about our beats, and thus we can use this information to change an animation, for example. So here is more precisely the
- Compute the FFT on the 1024 new samples taken in (an) and (bn). The FFT inputs a complex numeric signal. We will say (an) is the real part of the signal and (bn) the imagenary part. Thus the FFT will be made on the 1024 complex values of:
*You can find FFT tutorials and codes in C, Visual Basic or C++ on my homepage in the 'tutorials' section or by typing 'FFT' on Google.* - From the FFT we obtain 1024 complex numbers. We compute the square of their module and store it into a new 1024 buffer. This buffer (B) contains the 1024 frequency amplitudes of our signal.
- Divide the buffer into 32 subbands, compute the energy on each of these subbands and store it at (Es). Thus (Es) will be 32 sized and Es[i] will contain the energy of subband 'i':
- Now, to each subband 'i' corresponds an energy history buffer called (Ei). This buffer contains the last 43 energy computations for the 'i' subband. We compute the average energy <Ei> for the 'i' subband simply by using:
- Shift the sound energy history buffers (Ei) of 1 index to the right. We make room for the new energy value of the subband 'i' and flush the oldest.
- Pile in the new energy value of subband 'i' : Es[i] at Ei[0].
- For each subband 'i' if Es[i] > (C*<Ei>) we have a beat !
To help out visualising how the data piles work have a look at this scheme: Now the 'C' constant of this algorithm has nothing to do with the 'C' of the first algorithm, because we deal here with separated subbands the energy varies globally much more than with colorblind algorithms. Thus 'C' must be about 250. The results of this algorithm are convincing, it detects for example a symbal rhythm among other heavy noises in metal rock, and indeed the algorithm separates the signal into subbands, therefore the ## b - Enhancements and beat decision factors.There are ways to enhance a bit more our First we will increase the - Linear increase of the width of the subband with its index:
- We can choose for example the width of the first subband:
- The sum of all the widths must not exceed 1024 ((B)'s size):
Once you have equations So in fact in It may seem rather complicated but in fact it is not. Replacing this relation Using the advantages of Frequency selected beat detection you can also enhance the beat decision factor. Up to now it was based on a simple comparison between the instant energy of the subband and the average energy of the subband. This algorithm enables you to decide beats differently. You may want for examples to cut beats which correspond to high pitch sounds if you run techno music or you may want to keep only [50-4000Hz] beats if you are working with speech signal. This algorithm has the advantage of being perfectly adaptable to any kind or category of signal which was not the case of - If 'i' < 'N/2' then:
- Else 'i' >= 'N/2' then:
So 'i' is the index of the value in the FFT output buffer, N is the size of the FFT transform (here 1024), fe is the sample frequency (here 44100). Thus index 256 corresponds to 10025 Hz. This formula may be useful if you want to create your own subbands and you want to know what the correspondants Another way of filtering the beats, or selecting them, is choosing only those which are marked and precise enough. As we have seen before, The last (finally) way to enhance your beat detection, is to make the source signal pass through a Concerning the results of the I used The reason why I called this part of the document, 'Statistical streaming beat detection', is that the energy can be considered as a random variable, of which we have been calculating values over time. Thus we could interpret those values as a sampling for the statistical analysis of a random variable. But one can push this approach further. When we have separated the energies histories into 128 subbands we created in fact 128 random energy variables. We can then apply some of the general statistical methods of analysis. For example, the principal components analysis method will enable you to determine if some of the subbands are directly linked or independent. This would help us to regroup some subbands which are directly linked and thus make the beat decision more efficient. However, this method is basically just far too computing expensive and maybe just too hard to implement comparing to the results we want. If you are looking for a really good challenge in beat detection you could push in this direction. ## II Filtering rhythm detection## 1 Derivation and Comb filters## a - Intercorrelation and train of impulsesWhile in the first part of this document we had seen beat detection as a statistical sound energy problem, we will approach it here as a For our purposes we will always take alpha=1. This function quantifies the energy that can be exchanged between the signal x(t) and the signal y(t β), it gives an evaluation of how much those two functions look like each other. The role of the 'β' is to eliminate the time origin problem; two signals should be compared without regarding their possible time axis translation. As you may have noticed, we are creating this algorithm for a computer to execute; basically we have to re-write this formula in discrete mode, it becomes: Now the point is we can valuate the similarity of our song and of a train of impulses using this function. By the way, here is what a train of impulses looks like: So the period Ti of the impulses must correspond to the beats per minutes we want to test on our song. The formula that links a beat per minute value with the Ti in discrete time is the following: fs is the sample frequency, if you are using good quality wave files, it is usually of 44100. BPM is the beats per minute rate. Finally, Ti is the number of indexes between each impulse. Now because it is quite computing expensive to compute the ## b - The algorithmNote that this algorithm is much too computing expensive to be ran on a streaming source, or on a song in a whole. It is executed on a part of the song, like 5 seconds taken somewhere in the middle. Thus we assume that the tempo of the song is overall constant and that the middle of the song is tempo-representative of the rest of the song. So finally here are the steps of our algorithm:
- Choose roughly 5 seconds of data in the middle of the song and copy it to the 'a[k]' and 'b[k]' lists. The dimension of those lists is noted N. As usual 'a[k]' is the array for the left values, and 'b[k]' is the array for the right values.
- Compute the FFT of the complex signal made with a[k] for the real part of the complex signal and b[k] the imaginary part of the complex signal as seen in
*Frequency selected sound energy algorithm #1.*Store the real part of the result in ta[k] and the imaginary in tb[k]. - For all the beats per minute you want to test, for example from 60 BPM to 180 BPM per step of 10, we will note BPMc the current BPM tested:
- Compute the Ti value corresponding to BPMc using formula
**(R19)**. - Compute the train of impulses signal and store it in (l) and (j). (l) and (j) are purely identical, we take two lists of values to have a stereo signal so that we don't have dimension issues further on. Here is how the train of impulses is generated:
for (k = 0; k < N; k++) { if ((k % Ti) == 0) l[k] = j[k] = AmpMax; else l[k] = j[k] = 0; } - Compute the FFT of the complex signal made of (l) and (j). Store the result in tl[k] and tj[k].
- Finally compute the energy of the correlation between the train of impulses and the 5 seconds signal using
**(R17)'**which becomes: - Save E(BPMc) in a list.
- Compute the Ti value corresponding to BPMc using formula
- The rhythm of the song is given by the BPMmax, where the max of all the E(BPMc) is taken. We have the beat rate of the song!
The AmpMax constant which appears in the train of impulse generation, is given by the sample size of the source file. Usually the sample size is 16 bits, for a good quality .wav. If you are working in non-signed mode, the AmpMax value will be of 65535, and more often if you work with 16 bits signed samples the AmpMax will be 32767. One of the first ameliorations that could be given to As I said before, I haven't tested this algorithm myself; I can't compare its results with the algorithms of the first part. However, throughout the researches I made this seems to be the algorithm universally accepted as being the most accurate. It does have disadvantages: Great computing power consumption, can only be done on a small part of a song, assumes the rhythm is constant throughout the song, no possible streaming input (for example from a microphone), the signal needs to be known in advance. However it returns a deeper analysis than the first part algorithms, indeed for each BPM rate you have a value of its importance in the song. ## 2 Frequency selected processing## a - Curing colorblindnessAs in the first part of the tutorial, we have the same problem our last algorithm doesn't make the difference between cymbals beat or a drum beat, so the beat detection becomes a bit dodgy on very noisy music like hard rock. To heal our algorithm we will proceed as we had done in Let's modify the
- Choose roughly 5 seconds of data in the middle of the song and copy it to the 'a[k]' and 'b[k]' lists. The dimension of those lists is noted N. As usual 'a[k]' is the array for the left values, and 'b[k]' is the array for the right values.
- Differentiate the a[k] and b[k] signals using
**(R21)**. - Compute the FFT of the complex signal made with a[k] for the real part of the complex signal and b[k] the imaginary part of the complex signal as seen in
*Frequency selected sound energy algorithm #1.*Store the real part of the result in ta[k] and the imaginary in tb[k]. - Generate the 16 subband array values tas[k] and tbs[k] by cutting ta[k] and tb[k] with a logarithmic rule.
- For all the subbands tas[k] and tbs[k] (s goes from 1 to 16). Ws is the length of the 's' subband.
- For all the beats per minute you want to test, for example from 60 BPM to 180 BPM per step of 10, we will note BPMc the current BPM tested :
- Compute the Ti value corresponding to BPMc using formula
**(R19)**. - Compute the train of impulses signal and store it in (l) and (j). (l) and (j) are purely identical, we take two lists of values to have a stereo signal so that we don't have dimension issues further on. Here is how the train of impulses is generated:
for (k = 0; k < ws; k++) { if ((k % Ti) == 0) l[k] = k[k] = AmpMax; else l[k] = j[k] = 0; } - Compute the FFT of the complex signal made of (l) and (j). Store the result in tl[k] and tj[k].
- Finally compute the energy of the correlation between the train of impulses and the 5 seconds signal using
**(R17)'**which becomes: - Save E(BPMc,s) in a list.
- Compute the Ti value corresponding to BPMc using formula
- The rhythm of the subband is given by the BPMmaxs, where the max of all the E(BPMc,s) is taken (s is fixed). We will call this max EBPMmaxs. We have the beat rate of the subband we store it in a list.
- We do this for all the subbands, we have a list of BPMmaxs for all subbands, we can than decide which one to consider.
As in I must admit I haven't concretely tested this algorithm. Others have done this already and here an overview of the results for ## ConclusionFinding algorithms for beat detection is very frustrating. It seems so obvious to us, humans, to hear those beats and somehow so hard to formalise it. We managed to approximate more or less accurately and more or less efficiently this beat detection. But the best algorithms are always the ones you make yourself, the ones which are adapted to your problem. The more the situation is precise and defined the easier it is! This guide should be used as a source of ideas. Even if beat detection is far from being a crucial topic in the programming scene, it has the advantage of using lots of signal processing and mathematical concepts. More than an end in itself, for me, beat detection was a way to train to signal processing. I hope you will find some of this stuff useful. ## Sources and Links- http://www.owlnet.rice.edu/~elec301/Projects01/beat_sync/beatalgo.html : This site greatly inspired me for the second part of the tutorial, very well explained.
- http://www.cs.princeton.edu/~gessl/papers/amta2001.pdf : Audio analysis using the discrete wavelet transform.
- Tempo and beat analysis of acoustic musical signals by Eric D. Scheirer
- Pacemaker by Olli Parviainen
© 1999-2011 Gamedev.net. All rights reserved. |