Towards Automatic Monophonic
Music Transcription
Andrew J. Wales
Table of Contents
1. INTRODUCTION
2. REVIEW OF PREVIOUS WORK
2.1 AUDITORY RESEARCH
2.2 SIGNAL PROCESSING
2.3 AUDITORY SCENE ANALYSIS
3. ANALYSIS AND SPECIFICATION
3.1 LIMITATIONS
3.2 THE TEST SET
4. DESIGN
4.1 FAST FOURIER TRANSFORM
4.2 HAMMING WINDOW
4.3 SUBHARMONIC SUMMATION
4.4 VERTICAL EDGE FIND
6. RESULTS
6.1 OUTPUT FROM THE SYSTEM
6.2 DISCUSSION
6.3 CONCLUSIONS
6.4 FUTURE WORK
7. REFERENCES
The idea of this project is to create an automatic music transcription system. The system will be 'fed' pre-recorded music, and will analyse it and create the actual score of what is being played. It will take note of the pitch, duration and amplitude of all events, and note these in the output.
This project mainly draws on previous research in two fields: digital signal processing and auditory scene analysis. Digital signal processing concerns the mathematical analysis of data to filter it or perform a frequency analysis. Auditory scene analysis is about interpreting sensory input and deriving a meaningful representation.
The main mathematics in this project is based on the work of Jean Baptiste
Joseph Fourier in the early part of the nineteenth century. Fourier
discovered that continuous functions can be expressed as the sum of an
infinite series of sines and cosines, and this result can be used to obtain
a frequency analysis of a periodic waveform. With a few slight modifications
this can be used with aperiodic waveforms and when this analysis is performed
many times over data representing sound, it is possible to determine which
frequencies are present at any given time, and graphical representations
such as Figure 1-1 are produced.
Not everybody has perfect pitch, and it is not always feasible to attempt
a transcription by ear, so this system can be used instead. Even
if this was not the case, there would still be a reason for this work.
A parallel can be drawn with current speech research projects: humans
are perfectly capable of speech recognition, but there are many research
projects around the world covering all aspects of speech recognition, since
there are certain circumstances where it is preferable to have a computer
recognising the speech. There is also the intellectual challenge
of determining a scientific method of performing an intuitive human task.
The literature is also reviewed throughout the body of the report, when
it is appropriate to discuss it in terms of the work carried out.
2.1 Auditory Research
Bregman [1] discusses the difference in attitudes that existed towards research in vision and audition, and describes how audition may have been considered a "simpler sense" with "no important perceptual phenomena," but a few examples dismiss this and establish auditory perception as a valid research topic.
As this work will demonstrate, audition is anything but "simple."
2.2 Signal Processing
According to De Poli et. al [7] "the time-domain representation is not very meaningful." The implication of this is that it is essential to convert the data to the frequency domain before attempting to process it.
The literature concerning signal processing covers the subject to various
levels. They range from the straightforward (Pettit
They all have sections on Fourier analysis, and most cover advanced
aspects of digital signal processing not relevant to this project, such
as filters and formant detection.
2.3 Auditory Scene Analysis
The literature on auditory scene analysis covers the attempts to interpret sensory input and derive a meaningful representation. It concerns determining scientific methods of performing what the human auditory system does in the perception of sound. These books rarely contain large amounts of mathematical formulae. They often discuss experiments where human listeners are played a sound, be it noise, music or a combination, and their perceptions are recorded. For example the listener may hear a tone that is not actually present, and then the researchers have to examine their algorithms to see if they fit the results.
Although the book by Cooke et al. [6] is mostly concerned with the analysis of speech signals, it contains a chapter on pitch perception, and one on onset detection.
Handel [9] is particularly relevant to the analysis of music, and even goes so in depth as to include a chapter on the physiology of the human ear.
Hermes's chapter in Cooke et. al [6] admits that the zero-crossing pitch determination algorithm is "quite unexacting" but recommends that a more reliable pitch determination algorithm be employed if possible. Hermes says that subharmonic summation is a "very powerful PDA" and is "representative of the second generation" of pitch determination algorithms, which are the PDAs based on pitch-perception theories.
Subharmonic summation is particularly appropriate to the analysis of
music since it uses the information from the higher harmonics to estimate
the fundamental and most music has harmonics present. The only exception
would be music consisting of pure sine waves, and music such as this is
very rare indeed.
Hobley [11] attempted a similar system to this but
used zero-crossing for pitch determination. Zero-crossing is a rather
poor pitch determination algorithm. It cannot be easily expanded
to cope with polyphony. Also it is possible for zero-crossing to
miss high frequencies altogether. If, for example, the high-amplitude
low-frequency wave of Figure 3-1 is combined with the
low-amplitude high-frequency wave of Figure 3-2 as shown
in Figure 3-3, zero-crossing will miss the high-frequency
wave altogether.
In Cooke et. al [6] Hermes says that subharmonic summation is a "very powerful PDA" and is "representative of the second generation" of pitch determination algorithms, which are the PDAs based on pitch-perception theories. It was decided early on that subharmonic summation should used as the pitch determination algorithm.
In order the test the reliability of the pitch determination algorithm
it was fed the data of a computer-generated square wave and asked to find
its frequency. The results of this can be found in the section describing
subharmonic summation, and in particular Figure 4-11.
The accuracy of the onset and offset detection can be tested in a similar
manner. Passing generated data (with known onsets and offsets) to
the routine and comparing its outputs to what they should be will give
an accuracy such as was found when testing the subharmonic summation.
At the moment the system is setup for the well-tempered diatonic scale with A above middle C at 440Hz, as the following code fragment shows.
procedure do_soundtab; { makes the lookup sound table } { a above middle c = 440Hz } var oct,note:integer; notediff:integer; step:real; begin step:=exp(ln(2)/12); for oct:=0 to 7 do for note:=0 to 11 do begin notediff:=(3*12+9)-(oct*12+note); soundtab[oct*12+note]:=440/exp(notediff*ln(step)); end; end;
This could be modified quite easily to cater for music based on a different frequency for A (such as was in common use before this century) or the Pythagorean tuning (for example), or for microtonal music such as the quarter tone scale used in Islamic music.
However the usefulness of such a transcription would be questionable
since most commercial music synthesisers limit themselves to the the well-tempered
diatonic scale with A=440Hz. According to Moore [15]
this is "because they are intended primarily as cost-effective substitutes
for other traditional instruments rather than augmentations of the musical
palette." Clearly this should not be the case. A composer should
have greater musical freedom by using a computer than be restricted by
it.
3.2 The Test Set
The following files were used during testing of the system, in order
of complexity (judged arbitrarily).
3.2.1 Synthesised files
sin_440.wav |
A 440Hz sine wave. |
sqr_440.wav | A 440Hz square wave. |
440_arps.wav | 0.1s of silence, followed by a 0.5 second sine wave at 440Hz, 0.1s of silence, a 0.5s sine wave at 554.37Hz, 0.1s of silence, a 0.5s sine wave at 659.26Hz and 0.1s of silence. |
3_4_500.wav | Simultaneous sine waves at 300Hz, 400Hz and 500Hz. These should imply a missing fundamental of 100Hz. |
440_arpm.wav | Similar to 440_arps.wav but includes two harmonics for each sine wave. |
promenade.wav | 'Promenade' from Mussorgsky's 'Pictures at an Exhibition', consisting of a solo monophonic piano. |
come as you are.wav | 'Come As You Are' by Nirvana, a solo monophonic guitar (with a chorus effect). |
adagio sostenuto.wav | From 'Moonlight Sonata' by Beethoven. A solo piano, but polyphnonic. |
unhatched chickens.wav | 'Unhatched Chickens' by Mussorgsky. A solo piano, but polyphonic. |
a hard day's night.wav | The first chord of the Beatles's 'A Hard Day's Night'. Many instruments playing just one chord. |
All of these files last in the region of five seconds.
The work carried out combines four different algorithms.
Once the data is in the computer it is passed through each of the described
stages, as shown in Figure 4-1.
The Fourier transform is based on the work of Jean Baptiste Joseph Fourier who discovered that continuous functions can be expressed as the sum of an infinite series of sines and cosines. This result can only be used to obtain a frequency analysis of a periodic waveform.
The Fourier transform can only be adapted for a frequency analysis of
aperiodic waveforms by taking a section of the aperiodic data, treating
it as a continuous loop representing the entire waveform and performing
the Fourier transform on this data.
The Fourier transform takes the sampled data (such as in Figure
4-2) and determines the frequencies present in the sound.
The discrete Fourier transform of the signal x(n) over
the range 0<=n<=(N-1)
is defined as Equation 4-1 or more usually simplified
as Equation 4-2, where W is as defined in Equation
4-3.
and the spectral coefficients X(k) are evaluated for 0<=k<=(N-1)
(from Rabiner & Gold [19]).
The value for N determines the accuracy of the transform.
The frequency axis is divided into N sections, but in practice only
half this number are usable since the graph is symmetrical about its midpoint.
So the number of frequency bands is ½N. Since the highest
frequency representable in any data is half the sampling frequency (the
Nyquist rate (Rabiner & Schafer [20])), the width
of each frequency band in the Fourier transform is given by Equation
4-4.
This accuracy is not in itself enough to resolve pitches at the lower end of the expected range (say, around 65Hz, where the difference between adjacent semitones is less than 4Hz). However the resolution can be increased by doubling N which halves the width of each bar in the spectrogram, and by halving the sampling frequency which also halves the width of each bar in the spectrogram. So, if we halve the frequency from 44100Hz to 22050Hz and double N from 1024 to 2048, the width of each band will be
This is still not accurate enough to resolve individual peaks at around
the 65Hz mark, but if higher harmonics are present then subharmonic summation
can try to resolve these pitches.
Care must be taken when halving the sampling frequency in this manner
that no data is lost. The Nyquist rate must be taken into account
and the rate only halved if no frequency data will be lost.
Unfortunately, being an exponential algorithm the discrete Fourier transform is not efficient. It performs the same sets of calculations several times, without reusing the results. The fast Fourier transform is based on work from this century to make the Fourier transform more efficient, and it does make better use of interim results. The fast Fourier transform gives the same output as the discrete Fourier transform but it is a more optimised algorithm, which means it is quicker.
The fast Fourier transform is a logarithmic algorithm which means that
N can be increased with less of an overhead than if the discrete
Fourier transform is used.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The fast Fourier transform splits the data up into pairs and performs
discrete Fourier transforms with N=2, then combines the results
to give the same output as if a discrete Fourier transform of the whole
data had been done.
The Fourier transform gives the frequency and phase information for
each bar in the spectrogram, but this system disregards the phase information
and only works with the frequency data. De Poli et.
al [7] says "the ear is 'phase-deaf'" and "the description in terms
of the spectrum alone is concise and adequate."
The output of the Fourier transform is something like Figure
4-4.
Data such as in Figure 4-4 can be plotted over time
as in Figure 4-5 to give an idea of what is happening
in the data. It should be noted that the window is usually shifted
by a fraction of N (typically N/4) in order to smooth the
output somewhat.
The accuracy of the time axis (i.e., the width in ms of each vertical bar) can be determined thus:
Where f is the sampling frequency in Hz, N is the size
of the Fourier transform and d is the shift factor.
So for this system, N=512, d=4 and f=11025.
So
Therefore each point in the spectrogram represents the amount of energy present at that frequency for that given 11.6ms window.
This means that once an onset has been found in terms of the number
of windows into the file, it can be calculated in ms.
The data from the FFT is then passed to the subharmonic summation routine
to determine the pitches of the notes present, and to the vertical edge
find to determine the onset times for the notes (as described in Figure
4-1).
The mathematics behind the fast Fourier transform can be found in almost
any textbook on digital signal processing (see Kuc
4.2 Hamming Window
The FFT works by taking a section of the data and treating it as periodic,
and was representative of the entire waveform, then trying to determine
the frequencies present. This can cause errors when a waveform such
as Figure 4-6 is repeated as in Figure
4-7.
The abrupt ending of the data at the loop point can cause the FFT algorithm to believe that another frequency is present when in fact it probably is not.
So before the data is passed to the FFT it has to be passed through some sort of filter. There are several possible ways of doing this. One of the simplest is the Hamming window whereby the waveform is smoothly faded in and faded out, so the looping is more gradual.
The Hamming window is defined as
where n is taken from 0 to (N-1). This looks like
Figure 4-8, and when the waveform from Figure
4-6 is multiplied by this scaling factor, it looks like Figure
4-9.
Figure 4-9 then repeats smoothly as Figure
4-10, which is analysed more cleanly than Figure 4-7,
and is clearly more natural-looking.
4.3 Subharmonic Summation
A subharmonic is one which has a frequency which is a unit fraction of the fundamental. For example, a 100Hz tone would have subharmonics of 50Hz, 25Hz, 12½Hz, etc.
Subharmonic summation (SHS) works by examining each peak in the spectrogram (such as in Figure 4-4), then asking "Could this be the fundamental?", "If this is the second harmonic, what is the fundamental?", … , "If this is the n-th harmonic what is the fundamental?" where n is a suitably large value.
It then uses these possible fundamentals to determine which is most
likely by seeing which has the most 'votes' from all the peaks in the spectrogram.
For example, if peaks are detected at 300Hz, 450Hz and 1050Hz, subharmonic
summation will construct the following table:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
So, the subharmonic summation algorithm has determined that the most
likely fundamental is 150Hz. This description of subharmonic summation
owes quite a bit to Brown [2].
There seems to be no definitive way to determine which n is adequate
for determining the fundamental. Experimenting with values of n
produced the graph in Figure 4-11.
Figure 4-11 suggests that the accuracy of pitch
determination by subharmonic summation is at a maximum of 65% when n
is 19 or 20. (This example is for determining the pitch of a 220Hz square
wave, using a value of N=512 for the FFT. The subharmonic
summation was performed over 200 different sections of the square wave).
Hermes [10] used 15 subharmonics and noticed a error
rate of 1.4% for the pitch measurement of speech signals.
It is clear from Figure 4-11 that over a certain
n the accuracy actually decreases. It is possible that this
is caused by calculating so many subharmonics that the subharmonic summation
thinks that the real fundamental is the second (or third …) harmonic of
some imaginary subharmonic. This can be demonstrated using an example
based on Table 4-2. If the 'Harmonic number'
column was extended to, say, 14 values the column for 1050Hz would have
an entry of 75Hz. Ambiguities would then arise when the votes were
counted since there would be an identical number of entries for 150Hz and
75Hz.
It should be noted here that the autocorrelogram is usually the preferred
pitch determination algorithm. While not flawless itself, it can
explain almost all perceptual pitch phenomena. Its chief drawback
is its computational cost (Brown [2]).
4.4 Vertical Edge Find
The vertical edge find is performed on the data passed from the FFT
(such as Figure 4-5) in order to obtain onset and offset
times for notes.
Pratt [17] states that "vertical edge sharpening can
be obtained by the running difference operation which produces an output
image according to the relation
|
|
|
However, when the algorithm from Pratt [17] is performed
on real data (such as Figure 4-5), the result (in Figure
4-14) is less than ideal.
The data from Figure 4-14 has to be interpreted
and converted into onset times, which are then combined with the results
from the subharmonic summation into the data representing which notes are
actually present in the sound.
Spectrographic data is interpreted using the following algorithm (presented here in pseudocode).
for each vertical bar in the spectrogram (after vertical edge find)
do
sum all the values present at each frequency
output this value
next
The graph in Figure 4-15 looks extremely promising
for analysis and indeed can be interpreted simply using a linear decay
function. The output from such an analysis is given in Figure
6-1.
After seeing the output in Figure 4-16 it became
apparent that the algorithm worked very well on real instruments, but fell
down on synthesised sounds. A possible reason for this is that the
synthesised data is too pure, and the noise present in the acoustic data
actually helps the analysis. However when random noise was added
to the synthesised data it made no noticeable difference in the accuracy
of the onset detection.
The difference function was then changed to
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Even functions such as
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The function was also widened so it was effectively implementing a delay
inhibitor by performing the difference over a wider horizontal range but
this made no difference to the accuracy of onset detection.
The system is smart enough to attempt the subharmonic summation only at a recognised onset point. There is little point trying to determine the frequency continously through the duration of the note, since it is unlikely to change. If the performer applies a wide vibrato, for example by bending a guitar string, then the onset detector may see a new note being played.
However it must be noted that the system does not attempt the frequency
determination at the exact onset of a note. Because of the way the
onset detection works, the frequency analysis is performed on the third
or fourth window after the onset, which in this system can amount to some
30 or 40ms, giving the harmonics time to settle down to give a better estimate
of the true pitch.
The system was implemented in Borland Pascal 7 under Microsoft Windows 95. However it only uses standard MSDOS input and output functions, and so runs in a DOS prompt.
It runs at approximately 12 times real time on an Intel Pentium 133 system, so is not entirely unfeasible for the analysis of music. The operating speed is not affected by the amount of available memory.
Because very few Borland-specific commands have been used in implementing
the system it should be quite portable. For example, it should compile
under UNIX quite easily.
The system assumes that the input file is an eight bit mono 11,025Hz Windows .WAV file, although this could be changed quite simply by altering the code.
This sampling rate was chosen in order to get quite a high definition
frequency analysis, and the file was restricted to eight bit mono to restrict
the file size.
So far I have stuck quite well to the timetable in the first interim report (the revised version - the first edition was hopelessly unrealistic).
A pitch determination algorithm (subharmonic summation in this case) has been implementated quite successfully.
Onset and offset detectors (the vertical edge find) have also been implemented.
When the subharmonic summation was finally integrated with the onset
detection, the following results were noticed.
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
The accuracy percentage in this table reflects the percentage of times that the pitch estimation is exactly correct. Incorrect estimates are disregarding altogether, irrespective of how close they actually are. For example in the sin_440.wav pitch estimation, every time the pitch estimate was not 440Hz it was in fact 466.16Hz, representing an incorrect guess of just one semitone.
These values average out at an accuracy of 65.22%, although it is clear
to see that some waveforms analyse easier than others. A possible
reason for the total failure of the 3_4_500s.wav file could be
that the frequencies present do not have precise equivalents in the well-tempered
diatonic scale with A=440Hz, so the system is getting rather confused about
what it is hearing.
Accuracies for the acoustic files would be rather meaningless (especially
for the polyphonic recordings) since the comparison data would have to
be known to be perfect in order to determine the accuracy of the pitch
determination.
There are no tables to show the accuracy of onset detection since all
techniques tried were highly inaccurate and produced few meaningful values.
The system's output for each sample file is not given in full because
they are mostly rather long-winded because of the faulty onset detection,
and can easily be calculated by simply running the program anyway.
The project changed pace quite dramatically after the first viva. One of the main points that arose from the meeting was how little reading I had done. I had assumed that all the project would be completely straightforward. Once a few points had been gently pointed out I realised how much work really was involved.
The first thing I did after the viva was to read plenty more material,
including re-reading material I had already read. I then produced
a quite more realistic timetable for the rest of the first semester, which
I managed to keep to quite well.
While implementing the system, several points arose.
The pitch determination algorithm implemented works fine with synthesised
data but is not reliable for real data. Since Hermes
The onset detection implemented is fairly reliable for real data, but
not at all reliable for synthesised data.
6.4 Future Work
There are several ways in which the output of the system could be enhanced
to more accurately reflect the characteristics of the input file.
6.4.1 Interpreting the output with respect to time
signature and tempo
It is clearly desirable to have the output score correctly labelled with the time signature and tempo. Both of these may seem trivial at first thought but relatively simple examples can convince the reader otherwise.
The Beatles' song 'Happiness Is A Warm Gun' (Figure 6-4), for example, changes time signature four times in as many bars. A system which only has the data for the notes of the melody has little hope of determining what is happening.
However this problem is not just difficult for a computer. A musician
would find it difficult to recognise the time signature changes just from
hearing a melody such as this.
If more data was available, for example if it saw multiple onsets at
the start of each bar perhaps representing chords or percussion, then the
task may be simplified slightly.
The system should also recognise that an instruction such as accelerando
has been interpreted by the performer and label the score as such instead
of just putting the notes closer together and ignoring the tempo change.
6.4.2 Interpreting the output with respect to key
signature
The system should ideally mark the correct key signature on the output
score. Similar examples to the one for changing time signature could
demonstrate that the key signature can also change quite often.
6.4.3 Polyphonic transcription
Polyphonic transcription is a natural progression from monophonic transcription, but it presents a whole new set of problems.
These include
Polyphonic subharmonic summation is nontrivial, but the algorithm is
more capable than zero-crossing, for example, of coping with polyphony.
It is possible to perform the subharmonic summation many times over the
same set of data to attempt to find the polyphonic frequencies.
6.4.4 Accurate instrument map
The score would ideally be correctly marked for whichever instrument was playing. This is a non-trivial problem, even for monophonic music. The system could use dynamic time warping to pattern match to templates of instruments, perhaps in the frequency domain.
When the correct instrument has been determined it could then notate the music according to the characteristics of that instrument, i.e., taking into account the clef that is normally used and any transposition necessary.
Brown and Cooke [4] describe work to distinguish between
different instruments by examining the brightness and onset asynchrony
of harmonics.
6.4.5 Detection of offsets
This is less trivial than the onset problem since most acoustic instruments
do not have an easily defined offset. The notes often tail away until
they are no longer perceptible. However offset detection is essential
for the correct transcription of music. A pause in the recorded
music should be notated correctly. Notes should not sustain just
because another note has not started, but should accurately reflect the
end point.
6.4.6 Comparison of performances
With a few modifications the system could be used as a music tutor.
It could output a file through a synthesiser and ask the user to play it
back on their desired instrument. The data gathered by the system
could then be compared to what was output and a mark given for accuracy
in terms of correct pitch and timing.