There are different possibilities to code digitized music. This
thesis focuses on Pulse Coded Modulation (PCM) audio data
to which any digital music representation can be transformed. For
example, MP3 files can easily be decoded to PCM using almost any of the
currently available audio players.
The PCM data is the discrete representation of a continuous
acoustical wave. The amplitude is usually represented by 16 bits,
which allows the description of more than 65,000 different
amplitude levels. The time axis is usually sampled 44,100
times per second, which is one amplitude value approximately every 23
microseconds. The unit of the sampling frequency is
Hertz (Hz), measured in cycles per second. The amplitude
values are dimensionless although they correspond to sound pressure
levels. The actual level depends on the sound system used to
create physical acoustic waves.
Figure 1 illustrates 44kHz PCM data at different
magnitude levels. The song Freak on a Leash by Korn
is rather aggressive, loud, and perhaps even a little noisy. At least
one electric guitar, a strange sounding male voice, and
drums create a freaky sound experience. In contrast Für Elise by Beethoven is a rather
peaceful, classical piano solo. These two pieces of music will be
used throughout the thesis to illustrate the different steps of
the feature extraction process. The envelope of the acoustical wave in Für Elise over the interval of one-minute has several modulations and seldom reaches the highest levels, while Freak on a Leash is constantly around the maximum amplitude level. There is also a significant difference in the interval of only 20 milliseconds
(ms), where Freak on a Leash has a more jagged structure than Für Elise.
Besides finding a representation that corresponds with our perception, it is also necessary to reduce the amount of
data. A 5-second stereo sound sampled at 44kHz with 16 bit values
equals almost 7 megabytes of data. For this reason the music is
down-sampled to 11kHz, the two stereo channels are added up to one
(mono), and only a fraction of every song is used for further
processing. Specifically only every third 6-second sequence is further processed, starting
after the first 12 seconds (fade-in) and ending before the
last 12 seconds (fade-out). Additionally, zeros at the beginning or end of the music are truncated. This leads to a data
reduction by a factor of over 16.
Changing stereo to mono has no significant impact on the
perception of the music genre. Using only a small fraction of an
entire piece of music should be sufficient since humans are able
to recognize the genre within seconds. Often it is possible to
recognize the genre by listening to only one 6-second sequence of a piece of music. A more accurate classification is possible if a few sequences
throughout the piece of music are listened to. However, the first and the last seconds of a piece of music usually contain fade-in and fade-out effects, which do not help in determining the genre. Neither should the down-sampling affect the ability to recognize the genre. In simple experiments using average computer speakers it was hardly possible to recognize the difference between 44kHz and 11kHz for most
songs, while the genres remain clearly recognizable. Figure
2 depicts the effect of down-sampling. Notice that
some of the fine details are lost, however, the signals still look
alike. It is important to mention that down-sampling to
11kHz means that only frequencies up to 5.5kHz are noticeable. This is definitely far below the 16kHz an average human can hear, however, 5.5kHz are sufficient to cover the spectrum we use in speech, and almost
all frequencies used in music. Furthermore, very high frequencies are usually not perceived as pleasant and thus do
not play a very significant role in music.
|
Figure 2: The effects of down-sampling on a 20ms sequence of
Freak on a Leash. This sequence is the same as the one in
the last subplot of Figure 1.
|