Towards Automatic Monophonic

Music Transcription

 

Andrew J. Wales

 

 



 

Table of Contents
 

1.    INTRODUCTION

2.    REVIEW OF PREVIOUS WORK
2.1    AUDITORY RESEARCH
2.2    SIGNAL PROCESSING
2.3    AUDITORY SCENE ANALYSIS

3.    ANALYSIS AND SPECIFICATION
3.1    LIMITATIONS
3.2    THE TEST SET

4.    DESIGN
4.1    FAST FOURIER TRANSFORM
4.2    HAMMING WINDOW
4.3    SUBHARMONIC SUMMATION
4.4    VERTICAL EDGE FIND

5.    IMPLEMENTATION

6.    RESULTS
6.1    OUTPUT FROM THE SYSTEM
6.2    DISCUSSION
6.3    CONCLUSIONS
6.4    FUTURE WORK

7.    REFERENCES

 




1. Introduction

The idea of this project is to create an automatic music transcription system.  The system will be 'fed' pre-recorded music, and will analyse it and create the actual score of what is being played.  It will take note of the pitch, duration and amplitude of all events, and note these in the output.

This project mainly draws on previous research in two fields:  digital signal processing and auditory scene analysis.  Digital signal processing concerns the mathematical analysis of data to filter it or perform a frequency analysis.  Auditory scene analysis is about interpreting sensory input and deriving a meaningful representation.

The main mathematics in this project is based on the work of Jean Baptiste Joseph Fourier in the early part of the nineteenth century.  Fourier discovered that continuous functions can be expressed as the sum of an infinite series of sines and cosines, and this result can be used to obtain a frequency analysis of a periodic waveform.  With a few slight modifications this can be used with aperiodic waveforms and when this analysis is performed many times over data representing sound, it is possible to determine which frequencies are present at any given time, and graphical representations such as Figure 1-1 are produced.
 

Figure 1-1, a spectrogram, i.e., a graph of time (covering a range of about 7.5s) against frequency (from 0kHz to about 3kHz), where a darker point means higher energy at that point, representing 'Promenade' from Mussorgsky's 'Pictures At An Exhibition'

Not everybody has perfect pitch, and it is not always feasible to attempt a transcription by ear, so this system can be used instead.  Even if this was not the case, there would still be a reason for this work.  A parallel can be drawn with current speech research projects:  humans are perfectly capable of speech recognition, but there are many research projects around the world covering all aspects of speech recognition, since there are certain circumstances where it is preferable to have a computer recognising the speech.  There is also the intellectual challenge of determining a scientific method of performing an intuitive human task.
 
 




2. Review of Previous Work

The literature is also reviewed throughout the body of the report, when it is appropriate to discuss it in terms of the work carried out.
 

2.1 Auditory Research

Bregman [1] discusses the difference in attitudes that existed towards research in vision and audition, and describes how audition may have been considered a "simpler sense" with "no important perceptual phenomena," but a few examples dismiss this and establish auditory perception as a valid research topic.

As this work will demonstrate, audition is anything but "simple."
 
 

2.2 Signal Processing

According to De Poli et. al [7] "the time-domain representation is not very meaningful."  The implication of this is that it is essential to convert the data to the frequency domain before attempting to process it.

The literature concerning signal processing covers the subject to various levels.  They range from the straightforward (Pettit [16], which describes itself as being particularly appropriate to non-mathematicians) to the exhaustive (Proakis and Manolakis [18], which assumes a background in the field).

They all have sections on Fourier analysis, and most cover advanced aspects of digital signal processing not relevant to this project, such as filters and formant detection.
 
 

2.3 Auditory Scene Analysis

The literature on auditory scene analysis covers the attempts to interpret sensory input and derive a meaningful representation.  It concerns determining scientific methods of performing what the human auditory system does in the perception of sound.  These books rarely contain large amounts of mathematical formulae.  They often discuss experiments where human listeners are played a sound, be it noise, music or a combination, and their perceptions are recorded.  For example the listener may hear a tone that is not actually present, and then the researchers have to examine their algorithms to see if they fit the results.

Although the book by Cooke et al. [6] is mostly concerned with the analysis of speech signals, it contains a chapter on pitch perception, and one on onset detection.

Handel [9] is particularly relevant to the analysis of music, and even goes so in depth as to include a chapter on the physiology of the human ear.

Hermes's chapter in Cooke et. al [6] admits that the zero-crossing pitch determination algorithm is "quite unexacting" but recommends that a more reliable pitch determination algorithm be employed if possible.  Hermes says that subharmonic summation is a "very powerful PDA" and is "representative of the second generation" of pitch determination algorithms, which are the PDAs based on pitch-perception theories.

Subharmonic summation is particularly appropriate to the analysis of music since it uses the information from the higher harmonics to estimate the fundamental and most music has harmonics present.  The only exception would be music consisting of pure sine waves, and music such as this is very rare indeed.
 
 
 
 



 
 

3. Analysis and Specification
 

3.1 Limitations

Hobley [11] attempted a similar system to this but used zero-crossing for pitch determination.  Zero-crossing is a rather poor pitch determination algorithm.  It cannot be easily expanded to cope with polyphony.  Also it is possible for zero-crossing to miss high frequencies altogether.  If, for example, the high-amplitude low-frequency wave of Figure 3-1 is combined with the low-amplitude high-frequency wave of Figure 3-2 as shown in Figure 3-3, zero-crossing will miss the high-frequency wave altogether.
 

Figure 3-1, a time-domain representation of a high-amplitude, low-frequency wave
 
Figure 3-2, a time-domain representation of a low-amplitude, high-frequency wave
 
Figure 3-3, the composite of Figure 3-1 and Figure 3-2, which will cause zero-crossing pitch determination to give incorrect results
 

In Cooke et. al [6] Hermes says that subharmonic summation is a "very powerful PDA" and is "representative of the second generation" of pitch determination algorithms, which are the PDAs based on pitch-perception theories.  It was decided early on that subharmonic summation should used as the pitch determination algorithm.

In order the test the reliability of the pitch determination algorithm it was fed the data of a computer-generated square wave and asked to find its frequency. The results of this can be found in the section describing subharmonic summation, and in particular Figure 4-11.
 

The accuracy of the onset and offset detection can be tested in a similar manner.  Passing generated data (with known onsets and offsets) to the routine and comparing its outputs to what they should be will give an accuracy such as was found when testing the subharmonic summation.
 

At the moment the system is setup for the well-tempered diatonic scale with A above middle C at 440Hz, as the following code fragment shows.

procedure do_soundtab;
{ makes the lookup sound table }
{ a above middle c = 440Hz }
var
  oct,note:integer;
  notediff:integer;
  step:real;
begin
  step:=exp(ln(2)/12);
  for oct:=0 to 7 do
    for note:=0 to 11 do
    begin
      notediff:=(3*12+9)-(oct*12+note);
      soundtab[oct*12+note]:=440/exp(notediff*ln(step));
  end;
end;

This could be modified quite easily to cater for music based on a different frequency for A (such as was in common use before this century) or the Pythagorean tuning (for example), or for microtonal music such as the quarter tone scale used in Islamic music.

However the usefulness of such a transcription would be questionable since most commercial music synthesisers limit themselves to the the well-tempered diatonic scale with A=440Hz.  According to Moore [15] this is "because they are intended primarily as cost-effective substitutes for other traditional instruments rather than augmentations of the musical palette."  Clearly this should not be the case.  A composer should have greater musical freedom by using a computer than be restricted by it.
 

3.2 The Test Set

The following files were used during testing of the system, in order of complexity (judged arbitrarily).
 

3.2.1 Synthesised files
 
 
 
sin_440.wav
A 440Hz sine wave.
sqr_440.wav A 440Hz square wave.
440_arps.wav 0.1s of silence, followed by a 0.5 second sine wave at 440Hz, 0.1s of silence, a 0.5s sine wave at 554.37Hz, 0.1s of silence, a 0.5s sine wave at 659.26Hz and 0.1s of silence.
3_4_500.wav Simultaneous sine waves at 300Hz, 400Hz and 500Hz. These should imply a missing fundamental of 100Hz.
440_arpm.wav Similar to 440_arps.wav but includes two harmonics for each sine wave.
 
These synthesised files were all created with Cool Edit, a shareware sample editor/creator, which makes generation of such files quite simple.
 
 

3.2.2 Acoustic files
 
 
promenade.wav 'Promenade' from Mussorgsky's 'Pictures at an Exhibition', consisting of a solo monophonic piano.
come as you are.wav 'Come As You Are' by Nirvana, a solo monophonic guitar (with a chorus effect).
adagio sostenuto.wav From 'Moonlight Sonata' by Beethoven.  A solo piano, but polyphnonic.
unhatched chickens.wav 'Unhatched Chickens' by Mussorgsky.  A solo piano, but polyphonic.
a hard day's night.wav The first chord of the Beatles's 'A Hard Day's Night'.  Many instruments playing just one chord.
 
The acoustic files were all sampled from compact disc using an 8-bit SoundBlaster clone, and are saved as mono at 11025Hz.
 

All of these files last in the region of five seconds.
 
 




4. Design
 

The work carried out combines four different algorithms.

Before any of this can be done it is necessary to get a representation of the sound into the computer.  If the data is synthesised this could be a software program that outputs waveforms, but if 'real' data (i.e., acoustic instruments) is to be used then some sort of sound sampler is needed.
 

Once the data is in the computer it is passed through each of the described stages, as shown in Figure 4-1.
 

Figure 4-1, a diagram of the flow of data through the system
 
 
 
 

4.1 Fast Fourier Transform
 

The Fourier transform is based on the work of Jean Baptiste Joseph Fourier who discovered that continuous functions can be expressed as the sum of an infinite series of sines and cosines.  This result can only be used to obtain a frequency analysis of a periodic waveform.

The Fourier transform can only be adapted for a frequency analysis of aperiodic waveforms by taking a section of the aperiodic data, treating it as a continuous loop representing the entire waveform and performing the Fourier transform on this data.
 

The Fourier transform takes the sampled data (such as in Figure 4-2) and determines the frequencies present in the sound.
 

Figure 4-2, a time-domain graph (i.e., of time against amplitude) of a representative waveform
 

The discrete Fourier transform of the signal x(n) over the range 0<=n<=(N-1) is defined as Equation 4-1 or more usually simplified as Equation 4-2, where W is as defined in Equation 4-3.
 

Equation 4-1, standard equation for discrete Fourier transform
 
Equation 4-2, simplified Fourier transform
 
Equation 4-3, definition of W for simplified Fourier transform

and the spectral coefficients X(k) are evaluated for 0<=k<=(N-1) (from Rabiner & Gold [19]).
 
 

The value for N determines the accuracy of the transform.  The frequency axis is divided into N sections, but in practice only half this number are usable since the graph is symmetrical about its midpoint.  So the number of frequency bands is ½N.  Since the highest frequency representable in any data is half the sampling frequency (the Nyquist rate (Rabiner & Schafer [20])), the width of each frequency band in the Fourier transform is given by Equation 4-4.
 
 

Equation 4-4, for determining the width of each bar in a spectrogram for a given sampling rate and size of Fourier transform
 
 
So, with a sampling rate of 44100Hz and a Fourier transform of 1024 points, the width of each band will be

This accuracy is not in itself enough to resolve pitches at the lower end of the expected range (say, around 65Hz, where the difference between adjacent semitones is less than 4Hz).  However the resolution can be increased by doubling N which halves the width of each bar in the spectrogram, and by halving the sampling frequency which also halves the width of each bar in the spectrogram.  So, if we halve the frequency from 44100Hz to 22050Hz and double N from 1024 to 2048, the width of each band will be

This is still not accurate enough to resolve individual peaks at around the 65Hz mark, but if higher harmonics are present then subharmonic summation can try to resolve these pitches.
 

Care must be taken when halving the sampling frequency in this manner that no data is lost.  The Nyquist rate must be taken into account and the rate only halved if no frequency data will be lost.
 

Unfortunately, being an exponential algorithm the discrete Fourier transform is not efficient.  It performs the same sets of calculations several times, without reusing the results.  The fast Fourier transform is based on work from this century to make the Fourier transform more efficient, and it does make better use of interim results.  The fast Fourier transform gives the same output as the discrete Fourier transform but it is a more optimised algorithm, which means it is quicker.

The fast Fourier transform is a logarithmic algorithm which means that N can be increased with less of an overhead than if the discrete Fourier transform is used.
 
 
 

Number of Points
 
Complex Multiplications in DFT
 
Complex Multiplications in FFT
 
Speed Improvement Factor
N
 
N2
 
 (N/2)log2N
 
 
 
 
 
 
 
 
 
4
 
 16
 
 4
 
 4.00
8
 
 64
 
 12
 
 5.33
16
 
 256
 
 32
 
 8.00
32
 
 1,024
 
 80
 
 12.80
64
 
 4,096
 
 192
 
 21.33
128
 
 16,384
 
 448
 
 36.57
256
 
 65,536
 
 1,024
 
 64.00
512
 
 262,144
 
 2,304
 
 113.78
1024
 
 1,048,576
 
 5,120
 
 204.80
2048
 
 4,194,304
 
 11,264
 
 372.36
 Table 4-1, a comparison of the efficiency of the discrete Fourier transform and the fast Fourier transform, in terms of the number of multiplications of complex numbers needed for a given window size
 
Figure 4-3, a graph of N against Speed Improvement Factor, taking data from Table 4-2
 

The fast Fourier transform splits the data up into pairs and performs discrete Fourier transforms with N=2, then combines the results to give the same output as if a discrete Fourier transform of the whole data had been done.
 

The Fourier transform gives the frequency and phase information for each bar in the spectrogram, but this system disregards the phase information and only works with the frequency data.  De Poli et. al [7] says "the ear is 'phase-deaf'" and "the description in terms of the spectrum alone is concise and adequate."
 

The output of the Fourier transform is something like Figure 4-4.
 
 

Figure 4-4, a graph of frequency against intensity showing a sample frequency analysis
 

Data such as in Figure 4-4 can be plotted over time as in Figure 4-5 to give an idea of what is happening in the data.  It should be noted that the window is usually shifted by a fraction of N (typically N/4) in order to smooth the output somewhat.
 
 

Figure 4-5, a graph of time against frequency (darker=higher intensity) showing a typical spectrogram
 

The accuracy of the time axis (i.e., the width in ms of each vertical bar) can be determined thus:

where 
 

Where f is the sampling frequency in Hz, N is the size of the Fourier transform and d is the shift factor.
 

So for this system, N=512, d=4 and f=11025.

So
 

Therefore each point in the spectrogram represents the amount of energy present at that frequency for that given 11.6ms window.

This means that once an onset has been found in terms of the number of windows into the file, it can be calculated in ms.
 

The data from the FFT is then passed to the subharmonic summation routine to determine the pitches of the notes present, and to the vertical edge find to determine the onset times for the notes (as described in Figure 4-1).
 

The mathematics behind the fast Fourier transform can be found in almost any textbook on digital signal processing (see Kuc [13], Lynn & Fuerst [14], Moore [15], Proakis & Manolakis [18], Rabiner & Gold [19], Rabiner & Schafer [20] or indeed almost any other textbook on the subject).
 
 
 

4.2 Hamming Window
 

The FFT works by taking a section of the data and treating it as periodic, and was representative of the entire waveform, then trying to determine the frequencies present.  This can cause errors when a waveform such as Figure 4-6 is repeated as in Figure 4-7.
 

Figure 4-6, a graph of time against amplitude, showing a 320 sample window of a wave which has a period of 360 samples
 
 
Figure 4-7, a graph of time against amplitude, showing what is effectively being analysed by the FFT when Figure 4-6 is processed, although the window is treated as being infinite rather than simply having three copies of the data
 

The abrupt ending of the data at the loop point can cause the FFT algorithm to believe that another frequency is present when in fact it probably is not.

So before the data is passed to the FFT it has to be passed through some sort of filter.  There are several possible ways of doing this.  One of the simplest is the Hamming window whereby the waveform is smoothly faded in and faded out, so the looping is more gradual.

The Hamming window is defined as
 

(from Proakis & Manolakis [18])

where n is taken from 0 to (N-1).  This looks like Figure 4-8, and when the waveform from Figure 4-6 is multiplied by this scaling factor, it looks like Figure 4-9.
 
 

Figure 4-8, a graph of n against x[n] showing a plot of the Hamming function with N=360 and n from 0 to 359
 
Figure 4-9, a graph of time against amplitude showing how the wave from Figure 4-6 will look after multiplying by the Hamming function represented in Figure 4-8
 
Figure 4-10, a graph of time against amplitude showing how Figure 4-9 repeats more smoothly than Figure 4-6 repeats in Figure 4-7
 

Figure 4-9 then repeats smoothly as Figure 4-10, which is analysed more cleanly than Figure 4-7, and is clearly more natural-looking.
 
 
 

4.3 Subharmonic Summation

A subharmonic is one which has a frequency which is a unit fraction of the fundamental.  For example, a 100Hz tone would have subharmonics of 50Hz, 25Hz, 12½Hz, etc.

Subharmonic summation (SHS) works by examining each peak in the spectrogram (such as in Figure 4-4), then asking "Could this be the fundamental?", "If this is the second harmonic, what is the fundamental?", , "If this is the n-th harmonic what is the fundamental?" where n is a suitably large value.

It then uses these possible fundamentals to determine which is most likely by seeing which has the most 'votes' from all the peaks in the spectrogram.
 

For example, if peaks are detected at 300Hz, 450Hz and 1050Hz, subharmonic summation will construct the following table:
 
 

Harmonic Number
300Hz
450Hz
1050Hz
1
300
450
1050
2
150
225
525
3
100
150
350
4
75
113
263
5
60
90
210
6
50
75
175
7
43
64
150
8
38
56
131
9
33
50
117
10
30
45
105
Table 4-2, a demonstration of the subharmonic summation algorithm for ten harmonics, and determined peaks at 300Hz, 450Hz and 1050Hz

So, the subharmonic summation algorithm has determined that the most likely fundamental is 150Hz.  This description of subharmonic summation owes quite a bit to Brown [2].
 

There seems to be no definitive way to determine which n is adequate for determining the fundamental.  Experimenting with values of n produced the graph in Figure 4-11.
 
 
 

Figure 4-11, a graph of the accuracy of subharmonic summation for various different numbers of subharmonics.  For each value of n, that number of subharmonics was assumed and the frequency of a 220Hz square wave was estimated by the algorithm over 200 windows.  The accuracy is the percentage of times that the estimate was exactly correct.  Note that the n scale is non-linear

Figure 4-11 suggests that the accuracy of pitch determination by subharmonic summation is at a maximum of 65% when n is 19 or 20. (This example is for determining the pitch of a 220Hz square wave, using a value of N=512 for the FFT.  The subharmonic summation was performed over 200 different sections of the square wave).  Hermes [10] used 15 subharmonics and noticed a error rate of 1.4% for the pitch measurement of speech signals.
 

It is clear from Figure 4-11 that over a certain n the accuracy actually decreases.  It is possible that this is caused by calculating so many subharmonics that the subharmonic summation thinks that the real fundamental is the second (or third ) harmonic of some imaginary subharmonic.  This can be demonstrated using an example based on Table 4-2.  If the 'Harmonic number' column was extended to, say, 14 values the column for 1050Hz would have an entry of 75Hz.  Ambiguities would then arise when the votes were counted since there would be an identical number of entries for 150Hz and 75Hz.
 

It should be noted here that the autocorrelogram is usually the preferred pitch determination algorithm.  While not flawless itself, it can explain almost all perceptual pitch phenomena.  Its chief drawback is its computational cost (Brown [2]).
 
 
 

4.4 Vertical Edge Find

The vertical edge find is performed on the data passed from the FFT (such as Figure 4-5) in order to obtain onset and offset times for notes.
 

Pratt [17] states that "vertical edge sharpening can be obtained by the running difference operation which produces an output image according to the relation
 

This amounts to the following function, which is then applied to every point in the spectrogram.
 
 
1
0
-1
This seems to be quite effective when performed on high-contrast images, such as in Figure , resulting in the image in Figure 4-13.
 
 
Figure 4-12, abstract figures, NOT real spectrographic data
 
 
Figure 4-13, abstract figures from Figure 4-12 after simple vertical edge find
 

However, when the algorithm from Pratt [17] is performed on real data (such as Figure 4-5), the result (in Figure 4-14) is less than ideal.
 
 

Figure 4-14, copy of Figure 4-5 after vertical edge find
 

The data from Figure 4-14 has to be interpreted and converted into onset times, which are then combined with the results from the subharmonic summation into the data representing which notes are actually present in the sound.
 

Spectrographic data is interpreted using the following algorithm (presented here in pseudocode).

for each vertical bar in the spectrogram (after vertical edge find)
do
  sum all the values present at each frequency
  output this value
next
 
 
 
 

 
Figure 4-15, a graph of time against amplitude showing the onset data for "promenade.wav", plotted against the original spectrogram for the data.  The horizontal axis on the graph is in FFT windows, and the vertical axis (amplitude) is essentially arbitrary.
 

The graph in Figure 4-15 looks extremely promising for analysis and indeed can be interpreted simply using a linear decay function.  The output from such an analysis is given in Figure 6-1.
 
 

Figure 4-16, a graph of time against amplitude showing the onset data for "440_arpm.wav", plotted against the original spectrogram for the data.  The horizontal axis on the graph is in FFT windows, and the vertical axis (amplitude) is essentially arbitrary.  The file "440_arps.wav" gave similar output to this diagram, so it cannot be the harmonics that are confusing the system.
 

After seeing the output in Figure 4-16 it became apparent that the algorithm worked very well on real instruments, but fell down on synthesised sounds.  A possible reason for this is that the synthesised data is too pure, and the noise present in the acoustic data actually helps the analysis.  However when random noise was added to the synthesised data it made no noticeable difference in the accuracy of the onset detection.
 
 

The difference function was then changed to
 
 

1
0
-1
1
0
-1
1
0
-1
1
0
-1
1
0
-1
to perform vertical smoothing at the same time as the horizontal difference.  The effect it had on the abstract shapes from Figure 4-12 is shown in Figure 4-17, but it made little difference to real spectrographic data.
 
 
Figure 4-17, abstract shapes from Figure 4-12 using vertical smoothing as well as onset detection
 

Even functions such as
 
 

0.25
0
-0.25
0.5
0
-0.5
1
0
-1
0.5
0
-0.5
0.25
0
-0.25
were tried with little success.
 

The function was also widened so it was effectively implementing a delay inhibitor by performing the difference over a wider horizontal range but this made no difference to the accuracy of onset detection.
 
 

The system is smart enough to attempt the subharmonic summation only at a recognised onset point.  There is little point trying to determine the frequency continously through the duration of the note, since it is unlikely to change.  If the performer applies a wide vibrato, for example by bending a guitar string, then the onset detector may see a new note being played.

However it must be noted that the system does not attempt the frequency determination at the exact onset of a note.  Because of the way the onset detection works, the frequency analysis is performed on the third or fourth window after the onset, which in this system can amount to some 30 or 40ms, giving the harmonics time to settle down to give a better estimate of the true pitch.
 
 
 
 
 




5. Implementation

The system was implemented in Borland Pascal 7 under Microsoft Windows 95.  However it only uses standard MSDOS input and output functions, and so runs in a DOS prompt.

It runs at approximately 12 times real time on an Intel Pentium 133 system, so is not entirely unfeasible for the analysis of music.  The operating speed is not affected by the amount of available memory.

Because very few Borland-specific commands have been used in implementing the system it should be quite portable.  For example, it should compile under UNIX quite easily.
 

The system assumes that the input file is an eight bit mono 11,025Hz Windows .WAV file, although this could be changed quite simply by altering the code.

This sampling rate was chosen in order to get quite a high definition frequency analysis, and the file was restricted to eight bit mono to restrict the file size.
 
 

So far I have stuck quite well to the timetable in the first interim report (the revised version - the first edition was hopelessly unrealistic).

A pitch determination algorithm (subharmonic summation in this case) has been implementated quite successfully.

Onset and offset detectors (the vertical edge find) have also been implemented.
 
 
 
 
 
 



 

6. Results
 

6.1 Output from the System
 
 
Figure 6-1, onset map for "promenade.wav", before the pitch determination algorithm was integrated with the onset detection.  The only outputs are the amplitude (essentially an arbitrary number) and the onset time in ms.  This demonstrates how well the onset detection works for acoustic monophonic data.
 
Figure 6-2, onset map for "440_arps.wav", before the pitch determination algorithm was integrated with the onset detection.  The only outputs are the amplitude (essentially an arbitrary number) and the onset time in ms.  This shows how the onset detection falls down for synthesised data.
 

When the subharmonic summation was finally integrated with the onset detection, the following results were noticed.
 
 

File
Accuracy
sin_440.wav
73.63%
sqr_440.wav
100%
440_arps.wav
64%
3_4_500s.wav
0%
440_arpm.wav
88.46%
 

The accuracy percentage in this table reflects the percentage of times that the pitch estimation is exactly correct.  Incorrect estimates are disregarding altogether, irrespective of how close they actually are.  For example in the sin_440.wav pitch estimation, every time the pitch estimate was not 440Hz it was in fact 466.16Hz, representing an incorrect guess of just one semitone.

These values average out at an accuracy of 65.22%, although it is clear to see that some waveforms analyse easier than others.  A possible reason for the total failure of the 3_4_500s.wav file could be that the frequencies present do not have precise equivalents in the well-tempered diatonic scale with A=440Hz, so the system is getting rather confused about what it is hearing.
 

Accuracies for the acoustic files would be rather meaningless (especially for the polyphonic recordings) since the comparison data would have to be known to be perfect in order to determine the accuracy of the pitch determination.
 

There are no tables to show the accuracy of onset detection since all techniques tried were highly inaccurate and produced few meaningful values.
 

The system's output for each sample file is not given in full because they are mostly rather long-winded because of the faulty onset detection, and can easily be calculated by simply running the program anyway.
 
 
 

Figure 6-3, the output from "promenade.wav" as rendered in the commerical sequencer program 'Cakewalk.'  This is supposed to represent 'Promenade' from Mussorgsky's 'Pictures at an Exhibition'.  It does not take a trained musician to spot the difference between this and a playing of the .wav file.  In spite of how it may appear this was the best output from the selected acoustic files.
 
 
 
 

6.2 Discussion

The project changed pace quite dramatically after the first viva.  One of the main points that arose from the meeting was how little reading I had done.  I had assumed that all the project would be completely straightforward.  Once a few points had been gently pointed out I realised how much work really was involved.

The first thing I did after the viva was to read plenty more material, including re-reading material I had already read.  I then produced a quite more realistic timetable for the rest of the first semester, which I managed to keep to quite well.
 

While implementing the system, several points arose.
 

 
 

6.3 Conclusions

The pitch determination algorithm implemented works fine with synthesised data but is not reliable for real data.  Since Hermes [10] noticed errors of 1.4% I have to assume it my implementation which is at fault.

The onset detection implemented is fairly reliable for real data, but not at all reliable for synthesised data.
 
 
 

6.4 Future Work

There are several ways in which the output of the system could be enhanced to more accurately reflect the characteristics of the input file.
 

6.4.1 Interpreting the output with respect to time signature and tempo

It is clearly desirable to have the output score correctly labelled with the time signature and tempo.  Both of these may seem trivial at first thought but relatively simple examples can convince the reader otherwise.

The Beatles' song 'Happiness Is A Warm Gun' (Figure 6-4), for example, changes time signature four times in as many bars.  A system which only has the data for the notes of the melody has little hope of determining what is happening.

However this problem is not just difficult for a computer.  A musician would find it difficult to recognise the time signature changes just from hearing a melody such as this.
 

Figure 6-4, four bars from Happiness Is A Warm Gun, by John Lennon & Paul McCartney (© 1968 Northern Songs), demonstrating the difficulties in determining the changing time signatures of a piece of music
 

If more data was available, for example if it saw multiple onsets at the start of each bar perhaps representing chords or percussion, then the task may be simplified slightly.
 

The system should also recognise that an instruction such as accelerando has been interpreted by the performer and label the score as such instead of just putting the notes closer together and ignoring the tempo change.
 
 
 

6.4.2 Interpreting the output with respect to key signature

The system should ideally mark the correct key signature on the output score.  Similar examples to the one for changing time signature could demonstrate that the key signature can also change quite often.
 
 
 

6.4.3 Polyphonic transcription

Polyphonic transcription is a natural progression from monophonic transcription, but it presents a whole new set of problems.

These include
 

 
 
Figure 6-5, showing how middle C (261.63Hz) and the G above (392Hz) share harmonics at multiples of 784Hz
 

Polyphonic subharmonic summation is nontrivial, but the algorithm is more capable than zero-crossing, for example, of coping with polyphony.  It is possible to perform the subharmonic summation many times over the same set of data to attempt to find the polyphonic frequencies.
 
 

6.4.4 Accurate instrument map

The score would ideally be correctly marked for whichever instrument was playing.  This is a non-trivial problem, even for monophonic music.  The system could use dynamic time warping to pattern match to templates of instruments, perhaps in the frequency domain.

When the correct instrument has been determined it could then notate the music according to the characteristics of that instrument, i.e., taking into account the clef that is normally used and any transposition necessary.

Brown and Cooke [4] describe work to distinguish between different instruments by examining the brightness and onset asynchrony of harmonics.
 
 

6.4.5 Detection of offsets

This is less trivial than the onset problem since most acoustic instruments do not have an easily defined offset.  The notes often tail away until they are no longer perceptible.  However offset detection is essential for the correct transcription of music.   A pause in the recorded music should be notated correctly.  Notes should not sustain just because another note has not started, but should accurately reflect the end point.
 
 

6.4.6 Comparison of performances

With a few modifications the system could be used as a music tutor.  It could output a file through a synthesiser and ask the user to play it back on their desired instrument.  The data gathered by the system could then be compared to what was output and a mark given for accuracy in terms of correct pitch and timing.
 
 
 
 
 



 

7. References
 
 
  1. Bregman, A.S., Auditory Scene Analysis: The Perceptual Organization of Sound, MIT Press, 1990
  2. Brown G., COM325/677: Computer Speech and Hearing; Course Notes, University of Sheffield - Department of Computer Science, 1996
  3. Brown, G.J. and Cooke, M., Computational Auditory Scene Analysis, Computer Speech and Language (1994) 8, 297-336
  4. Brown, G.J. and Cooke, M., Perceptual Grouping of Musical Sounds: A Computational Model, Journal of New Music Research, Vol. 23 (1994), pp. 107-132
  5. Cooke, M., COM325: Computer Speech & Hearing; Course Notes, University of Sheffield - Department of Computer Science, 1996
  6. Cooke. M., Beet, S. and Crawford, M. (eds.), Visual Representations of Speech Signals, John Wiley & Sons, 1993
  7. De Poli, G., Piccialli, A. and Roads, C. (eds)., Representations of Musical Signals, MIT Press, 1991
  8. Godsmark, D., A Virtual Drum Kit, University of Sheffield - Department of Computer Science, 1993
  9. Handel, S., Listening: an introduction to the perception of auditory events, MIT Press, 1989
  10. Hermes, D.J., Measurement of pitch by subharmonic summation, Journal of the Acoustical Society of America 83 (1), January 1988
  11. Hobley, S.J., Meta-Karaoke - Music Synthesis by Voice, University of Sheffield - Department of Computer Science, 1992
  12. Howell, P., West, R. and Cross, I. (eds.), Representing Musical Structure, Academic Press, 1991
  13. Kuc, R., Introduction to Digital Signal Processing, McGraw-Hill, 1988
  14. Lynn, P.A. and Fuerst, W., Introductory Digital Signal Processing with Computer Applications, John Wiley, 1989
  15. Moore, F.R., Elements of Computer Music, Prentice Hall, 1990
  16. Pettit, F., Fourier Transforms in Action, Chartwell-Bratt Studentlitteratur, 1985
  17. Pratt, W.K., Digital Image Processing, John Wiley & Sons, 1978
  18. Proakis, J.G. and Manolakis, D.G., Introduction to Digital Signal Processing, Macmillan, 1988
  19. Rabiner, L.R. and Gold, B., Theory and Application of Digital Signal Processing, Prentice-Hall, 1975
  20. Rabiner, L.R. and Schafer, R.W., Digital Processing of Speech Signals, Prentice-Hall, 1978
  21. Rosenfeld, A. and Kak, A.C., Digital Picture Processing, Second Edition, Academic Press, Inc., 1982
  22. Tanguiane, A.S., Artifical Perception and Music Recognition, Lecture Notes in Artifical Intelligence 746, Springer-Verlag, 1993
 
 
  Click here for a zip of the accompanying floppy disk.
 
  amazon wishlist