ΑΡΙΘΜΟΣ: music

Showing posts with label music. Show all posts

Wednesday, November 10, 2021

FFT Parabolic Interpolation

Someone at the Numerix-DSP blog made the reasonable suggestion of interpolating the results of an FFT using a parabola. I did some experimenting with this method and found that it was not any worse than my other methods and it seemed less likely to turn out spurious results.

I've moved out the Maxima domain with my calculations and into Common Lisp, but to create the calculation for the parabolic interpolation, Maxima was of some help. First, we define a function for taking three points, and producing the parabola that is defined by that.

PolySolve(A, B, C) := block([polyForm, eqn1, eqn2, eqn3],
    polyForm: a*x^2 + b*x + c = y,
    eqn1: subst([x=A[1],y=A[2]], polyForm),
    eqn2: subst([x=B[1],y=B[2]], polyForm),        
    eqn3: subst([x=C[1],y=C[2]], polyForm), 
    subst(solve([eqn1, eqn2, eqn3],[a,b,c]), polyForm)
    );

Then we will solve for it. We make the observation that the shape of the parabola is the same regardless of shifting it left or right along the x-axis (or index axis, which corresponds to the frequency axis when we multiply by sampling rate). This being the case, we have no need to supply the final indices but can consider our middle point as index 0, the point before it as index -1 and the point after it as index 1. This simplifies the calculation so that we can find the location of the peak relative to the index we are examining and apply the results of this calculation to it. Time to pull in some grade 12 math and differentiate that parabola and find the vertex (see Fig. 1).

Fig. 1. The green dot is approximately where the vertex of this parabola is. The result of this calculation would have x as slightly positive.

poly: PolySolve([-1,y0],[0,y1],[1,y2]);
ratsimp(poly);
df: diff(poly, x);
freq: solve(df, [x]);

\[\frac{{{x}^{2}}\, \left( \mathit{y2}-2 \mathit{y1}+\mathit{y0}\right) }{2}-\frac{x\, \left( \mathit{y0}-\mathit{y2}\right) }{2}+\mathit{y1}=y\] \[\frac{\left( {{x}^{2}}+x\right) \, \mathit{y2}+\left( 2-2 {{x}^{2}}\right) \, \mathit{y1}+\left( {{x}^{2}}-x\right) \, \mathit{y0}}{2}=y\] \[x\, \left( \mathit{y2}-2 \mathit{y1}+\mathit{y0}\right) -\frac{\mathit{y0}-\mathit{y2}}{2}=0\] \[[x=-\frac{\mathit{y2}-\mathit{y0}}{2 \mathit{y2}-4 \mathit{y1}+2 \mathit{y0}}]\]

On the basis of this we can produce a common lisp function to perform this calculation.

(defun parabola-peak-frequency (y0 y1 y2)
  (let* ((y2-y0 (- y0 y2))
         (denom (+ (* 2 y2) (* -4 y1) (* 2 y0)))
         (x (/ y2-y0 denom))
         (x^2 (* x x))
         (amp (/ (+ (* (+ x^2 x) y2)
                    (* 2 (- 1 x^2) y1)
                    (* (- x^2 x) y0))
                 2)))
    (list x amp)))

Note that if the peak is to the left of our high point, we will get a negative x value and if it is to the right, a positive value. It will be on the interval \((-1, 1)\). Some code that goes through the results of an FFT and takes the absolute value as in our previous work and finds local peaks is all we need now. Where there is a local peak of consecutive absolute values \(a\), \(b\), and \(c\) such that \(a, c < b\), do some interpolation using parabola-peak-frequency to get an interpolated amplitude and an index adjustment for the frequency index. Then the calculation for the the frequency becomes

\[(o + x) s/N\]

where o is the index of \(b\), x is the result from parabola-peak-frequency, s is the sample rate, and \(N \)is the length of elements in the FFT.

Now, maybe we can start applying some math to some real world problems, like, um, Mozart's Piano Concerto No. 13, movement II? Maybe just a snatch within the first second of it for now.

Fig. 2. Mozart, anyone?

A general outline on how to proceed with this is to identify all of the local peaks, take the top K highest amplitude data points, interpolate those using the parabolic interpolation above, apply pitch correction (if desired) and midi note calculation. This doesn't address determining pitch durations. Another interesting problem is to try to guess which lower peaks are harmonics of lower pitches. These, and probably other confounding problems, yet await.

Friday, March 12, 2021

Finding Frequency Fourier Style

Last week, I had a look at the FFT module in Maxima and it was fun times. I was a little disappointed at the interpolated result for the frequency when analyzing the results of the FFT of the 440 Hz waveform. Mind you, if the application is to find a nearest note frequency and, being a human with an at least average sense of musical perception, you already know you're hearing notes that sound like 12 semitones per octave (exclusive counting, i.e., C up to B), then you can coerce the results to the nearest value of \((440) 2^{i/12}\), for \(i\) over integers. If you don't intend to get any deeper than determining the major pitches, then this is fine. (Could get more tricky if the whole thing is internally shift tuned—everything a half of a semitone flat, but internally tuned, but then maybe you could create a filter to catch that case and accordingly shift your frequency base from 440 to something else that models the sound better.)

We got close enough last time for goal 1 (find the basic notes), but still, I wondered whether I could get closer to the true frequency than last time. I also thought I should use a different example frequency so nobody gets the wrong idea that this is somehow only good for integer frequencies. (Integers are discrete, but discreteness goes beyond just integers.)

So, let's use \((440) 2^{1/12}\) or approximately 466.16 Hz.

Fig. 1. Frequency of 466.16 Hz.

Fig. 2. Zooming in a little closer on the points.

So, let's take these same points and try a cubic spline.

Fig. 3. Cubic spline view of things. Still looks fake. But less fake in the middle.

The cubic spline is helpfully printed out with characteristic functions which are mainly readable. I pick out the one that has the interval of concern, namely, 0.5264689995542815*x^3-738.1894409400452*x^2+345013.3968874384*x-5.374983574143041*10^7. We can use differentiation to find the location of the maximum value.

Which gives roots of [x=465.6995765525049,x=469.0682718425062]. Clearly 465.7 is the one we want. We're not any closer than we were last time using linear interpolation, although the general appearance of the graph looks more realistic and we did include another bit of the Maxima library.

Saturday, March 6, 2021

FFT - Local Symmetry Inferred Peak

As I was thinking about the previous post, I thought there might be a way to estimate the location of the peak. That is, to find the location in between the discrete data points where the peak probably occurs.

I tried using realpart and imagpart and abs to see the differences and it seemed like realpart gives me the best view of the relative amplitudes when there are multiple frequencies involved. I also decided to apply an index shift since that seems to match the frequency better although I'm not satisfied with it technically yet.

Let's zoom in on the location of one of the peaks and see what it looks like:

Fig. 1. Looks like we should be able to make a better guess than just picking one of the points.

Here are the actual points as frequency, absolute, real part value pairs:

[[427.972412109375,1.373307119161833],[430.6640625,1.783783923237471],[433.355712890625,2.525090138349324],[436.04736328125,4.273342049313825],[438.739013671875,13.47741003943344],[441.4306640625,11.94536977902466],[444.122314453125,4.166745114303215],[446.81396484375,2.532438127248608],[449.505615234375,1.822956914133206],[452.197265625,1.426082632315385],[454.888916015625,1.172306075942115]]

For a first crack at it, we might try linear interpolation on the two pairs of points on either side of the gap that contains the peak. Based on the Wikipedia article of the continuous version, we are certainly deviating from the shape of the real thing. Dauntless we press on to see what we get, because maybe we can live with a slight improvement that we know isn't perfect.

So, here we go do linear interpolation and find the line intersection using Maxima.

Line(A, B) := [x, y] . [A[2]-B[2], B[1]-A[1]] = [A[2]-B[2], B[1]-A[1]] . A; float(solve([ Line([436.04736328125,4.273342049313825],[438.739013671875,13.47741003943344]), Line([441.4306640625,11.94536977902466],[444.122314453125,4.166745114303215])], [x, y]));

[[x=439.729058165623,y=16.86285595968834]]

Let's put that point in the middle and see what it looks like:

Fig. 2. Hmm, closer, but looks kinda fake to me.

Somewhat predictably, this looks as fake as it really is. I think this graph makes it just how clear that linear interpolation is slightly bogus here. Not completely bogus though, it got us closer, right? So, we've ended up on the left side of the true peak which we know should happen exactly at 440 Hz. Why? The slope on the right side of the true peak is further away from the peak and therefore has a lower (absolute) slope value—it is less steep than it would be if it was closer to the peak. This is the weakness of linear interpolating this. We will end up closer to the higher of the two near-peak points than we should.

To get closer, we might want to try a different type of interpolation. Maybe a cubic spline?

Friday, March 5, 2021

FFT in Maxima

Maxima has a basic fast Fourier transform package called fft. Here's are really basic introduction to using it on music data. We define a basic sine wave form with frequency 440 Hz and amplitude 40 as our input and then run fft on it. The interpretation of the results as frequency depends on coordinating together the sample rate, the total number of samples (N), and the sample index (i).

Fig 1. Looks like about 440 Hz

The thing that most threw me for a loop in trying to make sense of the various sources I read was that the x axis needs to be inferred in order to be comprehensible. I wasn't really clear on what that x-axis was. If you dig into the details of the output numbers, you will find the 164th value, has the maximum value. This implies 164 * 44100 / 2^14 = 441 Hz. So, a bit of error has crept into our interpretation of the result because of the discrete nature of the intermediate result. What this means in theory is that there is a theoretical peak that "should" exist somewhat before the 164th index, but because we aren't working with continuous data, we can't see that peak (at the ~163.47th index 😜--or maybe we should be considering it at the 164.47 "index" and subtract 1, interpreting it as zero based, since the peak is probably between the 164th and 165th index).

If we change the function to have two wave forms at different frequencies (340 Hz and 440 Hz), we get both of these in the result:

Fig. 2. wave(20, 340, t*timeStep) + wave(40, 440, t*timeStep)

Next step, in search of the true peak.

Friday, December 30, 2016

Sounds Like F#

Last time, I thought that I was going to have to do some special research to account for multiple sources of sound. Reading this short physics snippet I can see that multiple sources is really as simplistic as my initial thinking was inclined to. You can ignore all of the decibel stuff and focus on the statements about amplitude, because, of course, I am already working in terms of amplitude. What can I say, I let the mass of voices on the internet trouble my certainty unduly? Sort of.

There's two important issues to consider:

Offsetting of the signals.
Psychological issues regarding the perception of sound.

Regarding number 2, here's a quick hit I found: http://hyperphysics.phy-astr.gsu.edu/hbase/Sound/loud.html. It seems credible, but I'm not sufficiently interested to pursue it further. And so, my dear Perception, I salute you in passing, Good day, Sir!

Offsetting of the signals is something that is implicitly accounted for by naively adding the amplitudes, but there is a question of what the sineChord() function is trying to achieve. Do I want:

The result of the signals beginning at exactly the same time (the piano player hits all his notes exactly[?] at the same time—lucky dog, guitar player not as much[?]. Hmm...)?
The average case of sounds which may or may not be in phase?

Regarding case 2, Loring Chien suggests in a comment on Quora that the average case would be a 41% increase in volume for identical sounds not necessarily in phase. At first glance, I'm inclined to think it is a plausible statement. Seeming to confirm this suggestion is the summary about Adding amplitudes (and levels) found here, which also explains where the the 41% increase comes from, namely, from \(\sqrt{2}\approx 1.414\dots\). Good enough for me.

All such being said, I will pass it by as I am approaching this from a mechanical perspective. I'm neither interested in the psychological side of things, nor am I interested in getting average case volumes. I prefer the waves to interact freely. If would be neat, for example, to be able to manipulate the offset on purpose and get different volumes. The naive approach is little more subject to experimentation.

I would like to go at least one step further though and produce a sequence of notes and chords and play them without concern of fuzz or pops caused by cutting the tops of the amplitudes off. We need two pieces: a) a limiter and b) a means of producing sequences of sounds in a musically helpful way.

A limiter is pretty straight-forward and is the stuff of basic queries. I have the advantage of being able to determine the greatest amplitude prior to playing any of the sound. The function limiter() finds the largest absolute amplitude and scales the sound to fit.

Here is an example usage:

One concern here is that we may make some sounds that we care about too quiet by this approach. To address such concerns we would need a compressor that affects sounds from the bottom and raises them as well. The general idea of a sound compressor is to take sounds within a certain amplitude range and compress them into a smaller range. So, sounds outside your range (above or below) get eliminated and sounds within your range get squeezed into a tighter dynamic (volume) range. This might be worth exploring later, but I do not intend to go there in this post.

Last post I had a function called sineChord(), which I realize could be generalized. Any instrument (or group of identical instruments) could be combined using a single chord function that would take as a parameter a function that converts a frequency into a specific instrument sound. This would apply to any instruments where we are considering the notes of the chord as starting at exactly the same time (guitar would need to be handled differently). So, instead of sineChord(), let's define instrumentChord():

Next we will create a simple class to put a sequence of notes/chords and durations into which we can then play our sounds from.

Reference Chords and Values

Here are a few reference chord shapes and duration values. I define the chord shape and the caller supplies the root sound to get an actual chord. The durations are relative to the whole note and the chord shapes are semi-tone counts for several guitar chord shapes. The root of the shapes is set to 0 so that by adding the midi-note value to all items you end up with a chord using that midi note as the root. This is the job of chord().

Next, InstrumentSequence is an accumulator object that knows how to output a wave form and play it. For simplicity, it is limited to the classic equal temperment western scale. Note that it takes a function has the task of turning a frequency into a wave form, which means that by creating a new function to generate more interesting wave forms, we can have them played by the same accumulator. If we only want it to produce the wave form and aggregate the wave forms into some other composition, we can access the wave form through the routine WaveForm.

type InstrumentSequence () =
        let chords = new ResizeArray<seq<int>>()
        let metrics = new ResizeArray<float>()

member this.AddNote note duration = 
            chords.Add(Seq.singleton note)
            metrics.Add(duration)

member this.AddChord notes duration = 
            chords.Add(notes)
            metrics.Add(duration)

member this.Chords = chords

member this.WaveForm instrument volume tempo sample_rate =
            let whole_note_duration = 240.0 / (float tempo)
            let runs = Seq.map2 (fun chord metric -> 
                                    let duration = metric * whole_note_duration
                                    instrumentChord instrument volume sample_rate duration chord)
                                    chords metrics
            seq { for r in runs do 
                        yield! r}

member this.PlayInstrument instrument volume tempo sample_rate =
            this.WaveForm instrument volume tempo sample_rate
            |> limiter volume
            |> Seq.map int
            |> PlayWave sample_rate

member this.PlayInstrumentAsync instrument volume tempo sample_rate =
            this.WaveForm instrument volume tempo sample_rate
            |> limiter volume
            |> Seq.map int
            |> PlayWaveAsync sample_rate

A sample use of the class which follows a very basic E, A, B chord progression follows:

To get the same progression with minors, use the EmShape, etc., values. But, at the end of the day, we've still got sine waves, lots of them. And, it's mostly true that,

boring + boring = boring.

I will not attempt the proof. You can make some more examples to see if you can generate a counter-example.

There is also some unpleasant popping that I would like to understand a bit better—for now I'm going to guess it relates to sudden changes in the phases of the waves at the terminations of each section of sound. Not sure what the solution is but I'll guess that some kind of dampening of the amplitude at the end of a wave would help reduce this.

Thursday, December 29, 2016

The Sound of F#

I'm doing a bit or reading and dabbling the in the world of sound with the hope that it will help me to tackle an interesting future project. One of the first steps to a successful project is to know more about what the project will actually entail. Right now I'm in exploratory mode. It's a bit like doing drills in sports or practicing simple songs from a book that are not really super interesting to you except for helping you learn the instrument you want to learn. I'm thinking about learning to "play" the signal processor, but it seemed like a good start to learn to produce signals first. And sound signals are what interest me at the moment.

Sound is pretty complicated, but we'll start with one of the most basic of sounds and what it takes to make them in Windows, writing in F#. Our goal in this post is to play sounds asynchronously in F# Interactive.

The first iteration of this comes from Shawn Kovac's response to his own question found here at Stack Overflow. Below is my port to F#. I have used the use keyword to let F# handle the Disposable interfaces on the stream and writer to take care of all the closing issues. It's important to note that the memory stream must remain open while the sound is playing, which means this routine is stuck as synchronous.

// ported from Shawn Kovac's response to his own question at: 
    // http://stackoverflow.com/questions/19672593/generate-morse-code-or-any-audio-in-net-c-or-vb-net-without-3rd-party-depe
    let PlaySound frequency msDuration volume =
        let TAU = 2.0 * System.Math.PI
        let formatChunkSize = 16
        let headerSize = 8
        let formatType = int16(1)
        let tracks = int16(1)
        let samplesPerSecond = 44100
        let bitsPerSample = int16(16)
        let frameSize = int16(tracks * ((bitsPerSample + int16(7)) / int16(8)))
        let bytesPerSecond = samplesPerSecond * int(frameSize)
        let waveSize = 4
        let samples = (samplesPerSecond * msDuration) / 1000
        let dataChunkSize = samples * int(frameSize)
        let fileSize = waveSize + headerSize + formatChunkSize + headerSize + dataChunkSize
        
        // var encoding = new System.Text.UTF8Encoding();
        use mStrm = new MemoryStream()
        use writer = new BinaryWriter(mStrm)
        writer.Write(0x46464952) // = encoding.GetBytes("RIFF")
        writer.Write(fileSize)
        writer.Write(0x45564157) // = encoding.GetBytes("WAVE")
        writer.Write(0x20746D66) // = encoding.GetBytes("fmt ")
        writer.Write(formatChunkSize)
        writer.Write(formatType)
        writer.Write(tracks)
        writer.Write(samplesPerSecond)
        writer.Write(bytesPerSecond)
        writer.Write(frameSize)
        writer.Write(bitsPerSample)
        writer.Write(0x61746164) // = encoding.GetBytes("data")
        writer.Write(dataChunkSize)
        
        let theta = frequency * TAU / (float samplesPerSecond)
        // 'volume' is UInt16 with range 0 thru Uint16.MaxValue ( = 65 535)
        // we need 'amp' to have the range of 0 thru Int16.MaxValue ( = 32 767)
        let amp = volume >>> 2 // so we simply set amp = volume / 2

for step = 0 to samples-1 do
            let s = int16(float(amp) * sin(theta * (float step)))
            writer.Write(s)

// set ourselves up at the beginning of the file
        mStrm.Seek(int64(0), SeekOrigin.Begin) |> ignore

// we use a player object which requires its memory stream to exist in order to play
        use player = new System.Media.SoundPlayer(mStrm)
        player.PlaySync()

To get an asynchronous sound we can shove the PlaySound function into a background thread and let it go. This looks like the following:

A limitation of the PlaySound() approach given above is that it limits your method of sound production. The details of the sound produced are buried inside the function that plays the sound. I don't want to have a bunch of separate routines: one that plays sine waves, one that plays saw tooth, one that plays chords with two notes, one that plays chords with three notes, one that include an experimental distortion guitar—whatever. The sound player shouldn't care about the details, it should just play the sound, whatever it is. (Not a criticism against Kovac, he had a narrower purpose than I do; his choice was valid for his purpose.)

I want to decouple the details of the sound to be produced from the formatting and playing of the sound. Here we have the basic formatting and playing of the sound packaged up that will take any wave form sequence we give it (sequence of integer values):

// suggested sample_rate = 44100
    let PlayWave sample_rate wave = 
        let TWOPI = 2.0 * System.Math.PI
        let formatChunkSize = 16
        let formatType = int16(1)
        let tracks = int16(1)
        let bitsPerSample = int16(16)
        let frameSize = int16(tracks * ((bitsPerSample + int16(7)) / int16(8)))
        let bytesPerSecond = sample_rate * int(frameSize)

use mStrm = new MemoryStream()
        use writer = new BinaryWriter(mStrm)
        // we write the initial byte of each item of data in the comment to the right
        // this is so we can come back later and write to two of these locations
        writer.Write(0x46464952)        // = encoding.GetBytes("RIFF"): 0
        writer.Write(0)                 // filesize needs to be calculated and filled in at byte 4
        writer.Write(0x45564157)        // = encoding.GetBytes("WAVE"): 8
        writer.Write(0x20746D66)        // = encoding.GetBytes("fmt "): 12
        writer.Write(formatChunkSize)   //: 16
        writer.Write(formatType)        //: 20
        writer.Write(tracks)            //: 22
        writer.Write(sample_rate)       //: 24
        writer.Write(bytesPerSecond)    //: 28
        writer.Write(frameSize)         //: 32
        writer.Write(bitsPerSample)     //: 34
        writer.Write(0x61746164)        // = encoding.GetBytes("data"): 36
        writer.Write(0)                 // dataChunkSize needs to be calculated and filled in: 40
        
        let mutable samples = 0

wave 
        |> Seq.iter (fun (s:int) -> 
                        writer.Write((int16 s))
                        samples <- samples + 1)

// fill in the datachunk
        mStrm.Seek(int64(40), SeekOrigin.Begin) |> ignore
        let dataChunkSize = samples * int(frameSize)
        writer.Write(dataChunkSize)

// fill in the filesize
        let filesize = 4 + (8 + formatChunkSize) + (8 + dataChunkSize)
        mStrm.Seek(int64(4), SeekOrigin.Begin) |> ignore
        writer.Write(filesize)

// set ourselves up at the beginning of the file to play it
        mStrm.Seek(int64(0), SeekOrigin.Begin) |> ignore

// we use a player object which requires its memory stream to exist in order to play
        use player = new System.Media.SoundPlayer(mStrm)
        player.PlaySync()

let PlayWaveAsync sample_rate wave =
        let soundThread = Thread((fun () -> PlayWave sample_rate wave))
        soundThread.IsBackground <- true
        soundThread.Start()

I haven't decoupled this as far as we could. If we wanted to handle multiple tracks, for example, you would need to take this abstraction a little further.

Now, we also need some functions for generating wave forms. Sequences are probably a good choice for this because you will not need to put them into memory until you are ready to write them to the memory stream in the PlayWave function. What exactly happens to sequence elements after they get iterated over for the purpose of writing to the MemoryStream, I'm not quite clear on. I know memory is allocated by MemoryStream, but whether the sequence elements continue to have an independent life for some time after they are iterated over, I'm not sure about. The ideal thing, of course, would be for them to come into existence at the moment they are to be copied into the memory stream (which they will, due to lazy loading of sequences) and then fade immediately into the bit bucket—forgotten the moment after they came into being.

Here are a few routines that relate to making chords that are not without an important caveat: the volume control is a bit tricky.

let ChordEOpen = seq[64; 71; 76; 80; 83; 88]
    let ChordAOpen = seq[69; 76; 81; 85; 88]

// duration is in seconds
    let sinewave volume freq_hertz sample_rate duration =
        let TWOPI = 2.0 * System.Math.PI
        let timestep = 1.0 / (float sample_rate)
        let steps = int(floor(duration / timestep))
        let theta = freq_hertz * TWOPI * timestep
        seq { for i = 1 to steps do 
                yield volume * sin(theta * float(i)) }

let frequency note =         
        440.0 * exp(float(note - 69)/12.0 * log(2.0))

// produce sine wave form for several different notes
    // volume is the volume of each individual note; 
    // it is up to the caller to deal with the possible need for compression on the resultant signal
    // or to keep the volume under control from the start
    let sineChord volume notes sample_rate duration = 
        notes
        |> Seq.map frequency
        |> Seq.map (fun f -> sinewave volume f sample_rate duration)
        |> Seq.reduce (fun a b -> Seq.map2 (fun sa sb -> sa + sb) a b)

Here is a sample script to run in F# Interactive:

The above plays a version of an A chord (5th fret E-form bar chord on the guitar).

Note that I have divided the volume by the number of notes in the chord. If we don't watch the volume like this and we let it get away on us, it will change the sound fairly dramatically. At worst something like the "snow" from the old analog TVs when you weren't getting a signal, but if you just go a little over, you will get a few pops like slightly degraded vinyl recordings.

We could include volume control in our wave form production routines. This wouldn't necessarily violate our decoupling ideals. The main concern I have is that I want the volume indicated to represent the volume of one note played and the other notes played should cause some increase in volume. I want the dynamic fidelity of the wave forms to be preserved with different sounds joining together. Even after that, you may want to run a limiter over the data to make sure the net volume of several sounds put together doesn't take you over your limit.

However, I need to do more research to achieve these ideals. For now, I will break it off there.

ΑΡΙΘΜΟΣ

Pages