Thursday, September 2, 2021

SQLite Bulk Insert

 I was trying to insert a lot of records into an SQLite database and it was taking a long time. Lots amounted to over 137000 in one table. It was taking hours to complete. I decided it must be possible to do the insertion faster with some dynamic SQL. The approach was simple and probably lends itself to some basic abstraction, but I didn't go that far. I could probably use reflection to remove some of the icky repetitive code in it, but this was ok for my immediate needs.

First up, a basic abstract class to inherit. I include some basic convenience functions in it although it might not be great style.


Next up, inherit the class in the table definitions you will use.

 

Here are the two main reuseable routines that save a lot of time doing the inserts. They assume all the records are really of the same type, which is the most common application for a bulk insert. I don't know of any reason to make the inserts accommodate multiple types, but it would be possible to apply a partition at the start if you have an application for that. 
 

Note that the syntax for these inserts is very specific to SQLite.

Saturday, May 22, 2021

What's in Your Vocabulary Deck?

I am something of a student of New Testament Greek and made my beginnings a few years ago with Mounce's Basics of Biblical Greek. Along with the text book was a flash card program called FlashWorks, and it is a solid way to learn your vocab. The vocab was indexed with frequency and chapter and user level difficulty (based on how often you answered correctly) which was almost all that I wanted to help me keep track of which words I should be working on. 

There was one thing that I felt was missing, though I wasn't sure how to define it or to decide what could be done about it. Something like, "how do I get a balanced randomness to my vocab selection if I only want to do a few right now?"

I don't want to randomly select 10 words from out of 500 words. I won't get much repetition if I do that every day. On the other hand, if I only focus on the most frequent, hard words, then I repeat them every day, then the risk might be that I "cram" them and don't really get them into long term memory. So, maybe the thing to do is to start with a selection of 30 words that are the most difficult/frequent words and randomly select 10 from that list. I call it a "pool factor". The pool factor in that case, would be 3. Make a selection of the most difficult, frequent words up to 30 long (and include a random sorting factor in case there are contenders for the last spots in the 30) and then randomly select 10 out of the 30.

The 30 words are a kind of "working set" that I am working toward mastering, although, I don't need to specifically decide on them, they get selected based on criteria. I do 10 words today out of those 30. As I do the daily work, some of those 30 words drop out because I got them right a bunch of time and new words enter the selection of 30.

The next issue is to decide what is meant by hardest, frequent words. If I use a score that starts high for words that have never been marked correct, then I can do a descending sort on score, then by frequency. The whole deck will start at the highest score and initially I am getting the most frequent words. In order to prevent words from disappearing from this set too soon, keep track of not just a score, but the number of times marked correct. Only decrease the score after after getting the word right correct, say, 3 consecutive times. (Since the score of a word doesn't change until you have reached a level of mastery with it, you don't run into the scenario of interchanging sets of words that you then never master.)

Note that you can probably apply this strategy to other types of vocabulary that don't reference a fixed body of literature, but you have to come up with some kind of importance rating on each word in your database that serves the same purpose as the frequency field here.

The code below belongs to a C# project of mine that is using SQLiteAsyncConnection with some fields that are still missing data (hence, ifnull):

        public Task<List<Vocab>> GetNRandomHardFrequent(int numberOfWords, double poolFactor)
        {
            int PoolSize = Convert.ToInt32(numberOfWords * poolFactor);
            Random rand = new Random();

            string query = @"
                    select 
                        sub.*
                    from
                        (select 
                            v.*
                        from
                            Vocab v
                        order by
                            ifnull(v.score, 0) desc, ifnull(v.frequency, 0) desc, RANDOM()
                        limit ?) sub
                    order by
                        RANDOM()
                    limit ?;";
                        
            var words = database.QueryAsync<Vocab>(query, PoolSize, numberOfWords); 

            return words;
        }

Wednesday, March 31, 2021

Mixalot - Except Not

{This is a failed attempt at using mixalot in Windows. I am posting it not as information, but the description of a journey. Not all ventures succeed.}

I was looking for a way to read and play music files in Common Lisp and found mixalot on the CLiki. When I tried loading using quicklisp, it was there to be loaded, but it went looking for a library called libao and couldn't find it. I assume that stands for Library Audio Output and the library is located here from Xiph. So it is easily downloaded, but I had no idea where to put it so that the load for mixalot could find it. Reading the back trace I could see that mixalot uses the CFFI library to try to load libao and the particular function that was failing is the exported function LOAD-FOREIGN-LIBRARY. The documentation for that function is here and has a brief description of where it looks for stuff. Entering cffi::*foreign-library-directories* at the REPL yielded nil, which left me suspicious that maybe this isn't what mixalot would depend on. (Why would it depend on something that isn't usually set to some meaningful default setting?)

I decided I needed to look at the code. The main mixalot.lisp has some conditionals and the one I needed to concern myself with was the #+mixalot::use-ao. It is clear from the comments in the code that this was tried specifically on OS X. (I have confessed in previous posts to being a Windows user.)


The first of those line of file names is macOS related, apparently. I'm guessing this define-foreign-library function is a macro that turns into a case statement or something similar and the t line is the catch all that I need to make work. I have never seen ".so" as a file type before. Turns out to be a Linux thing (Shared Object) that is like a Windows DLL. Looks like I am supposed to compile from source to get myself a DLL. I needed to install updates for Visual Studio to include the "Desktop development for C++" workload. (I might have been able to find a way to use the make files that come with the source code, but I am not familiar with that world, so I defaulted to something more familiar to me.)

Next, I created a solution in VS 2017 Community. I modified the directory structure and include statements so that my compiler could find everything. There was another include that I needed, called dirent.h. I got that from here and copied it to C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.16.27023\include, based on suggestion from Darran Rowe here. I also discovered there is a frightening C #define syntax that allows variable length parameter lists in gcc and there is something like this also available in Visual Studio (hmm, can't find the barf emoji). I disabled some warnings using a preprocessor option as per this SO answer. I needed to put _ in front of some strnup function calls. I don't care why, I just did what the compiler told me. 

On the file ao_wmm.c I was getting:
  • Error C1189 #error:  KS.H must be included before KSMEDIA.H libao2 c:\program files (x86)\windows kits\10\include\10.0.17763.0\shared\ksmedia.h 18
So, put an include to <ks.h> right above <ksmedia.h> which resolved that.

That took care of the base files of the project, and then I needed to add an entry point. After that, I am stuck because I end up with several errors along the lines of:
  • LNK2019, unresolved external symbol __imp_waveOutGetNumDevs referenced in function ao_wmm_set_option
12 of them. I have been .NET'ing for my day job for a while and never did go deep on the C/C++ end of things, so I have no idea what I'm looking at here or where to start understanding this better. I still have no idea whether I actually need one or more plugins to be included for this project to be useful.

I did find references to pre-compiled binaries for this project online (probably) but they weren't presented on websites that were curated in a way that gave me confidence I was dealing with reliable, on-the-level kind of sources and I eschewed every one of them. Call it a rejection of playing fast-and-loose, a.k.a., ethics and security.

Saturday, March 27, 2021

FFT - DIY

There's more than enough information freely available to enable you to make your own FFT. You can review the pseudo-code from Wikipedia to make it yourself. I have been raving about this bit of Mathematics for several posts now. I knew that the FFT (\(O(N \log {N})\)) was much faster than computing the DFT (\(O(N^2)\)) from the definition. But how much faster is it really when you compare the methods with realistic data sets? Sorting algorithms have this difference but with small data sets (even thousands of points) like most of us work with, it doesn't matter that much.

Whenever you go about to make something that is somewhat complex, you want a more than casual check to verify that you have the right answer. One way of doing this is to create a simple method that doesn't have any complexities related to speeding up the algorithm or computation in it and use it as a reference against the more complex method you want to test. You can also find the results of someone else's work and compare against that.

I made both a DFT and FFT (dit-fft2) method and tried them out. I noticed while testing these out that I get results that agree with each other, but they differ from the results I got in Maxima in two ways: 1) the overall values appear to be factored in the Maxima FFT and there was a difference in sign on the imaginary component. I can only assume there is a very good reason for this that is outside my experience. To test for consistency I used create-test-data for a small data set of 32 points and applied both dft and dit-fft2 to it. They come out the same, so I'm reasonably certain that the calculation was done correctly. 

Next, I tried a larger data set, \(2^{14}\) data points. With a sampling rate of 44100 Hz, this is about 0.37s worth of PCM data. The DFT function ran in 2:41 min and the dit-fft2 function ran in 0.125s. Even at 0.125s, that is a long time to process if you have much data to churn through. A brief review of the dit-fft2 function will show that there is a lot of data copying involved in this version and so methods which avoid the need to copy data around by using an indexing scheme are likely to increase performance significantly.

Enough already—code me!

Monday, March 22, 2021

FFT - Revisiting Symmetry

There was something missing in my last post that might have created part of the problem with the results. Although I observe what looked like symmetry, I didn't constrain any of my interpolated results based on that assumption of symmetry. So, I thought to myself, there ought to be a mathematical way to tell my equations that they should assume a symmetrical situation, the way I am expecting it to look. I'm not really sure this is broadly applicable, but my intuition about a sine wave is that it should have some kind of symmetry in the resulting appearance. This seems to be borne out by scattered examples of continuous Fourier transforms.

I'm not yet sure how to use symmetry, to find the value, but I can see a way to evaluate the accuracy of a given guess. Mind you, I only know how to evaluate it by appearance. Let's introduce a little formula I will call the flip formula.

$$F(x, C) = 2 C - x$$

Suppose you choose \(C = 440\), meaning that you are going to treat 440 as the center. So, for example, \(F(439, 440) = 441\) and \(F(441, 440) = 439\). Let's apply the flip function to the data in one of our previous examples and see what it looks like when we assume the center is 440. 

First, here is the graph we had with just a few points near the peak:

Fig. 1. Just the results of the FFT.

 
 Let's have a look at what it looks like when we use 439.5, 439.8, and 440.
Fig. 2. "Guessing" frequency 439.5 Hz. Sure doesn't look smooth.

Fig. 3. "Guessing" frequency 439.8 Hz. Looks better.

Fig. 4. "Guessing" frequency 440 Hz. Looks about as smooth as we're going to get.

There are some considerable drawbacks to this approach. The biggest one is that I had to look at the results. The second drawback is that I don't have a number to put to these to decide which is better. And this is transformed from one of the simplest types of wave, a sine wave. To evaluate the soundness of the guess, I would need some formula to fit the points to.

Friday, March 12, 2021

Finding Frequency Fourier Style

Last week, I had a look at the FFT module in Maxima and it was fun times. I was a little disappointed at the interpolated result for the frequency when analyzing the results of the FFT of the 440 Hz waveform. Mind you, if the application is to find a nearest note frequency and, being a human with an at least average sense of musical perception, you already know you're hearing notes that sound like 12 semitones per octave (exclusive counting, i.e., C up to B), then you can coerce the results to the nearest value of  \((440) 2^{i/12}\), for \(i\) over integers. If you don't intend to get any deeper than determining the major pitches, then this is fine. (Could get more tricky if the whole thing is internally shift tuned—everything a half of a semitone flat, but internally tuned, but then maybe you could create a filter to catch that case and accordingly shift your frequency base from 440 to something else that models the sound better.)

We got close enough last time for goal 1 (find the basic notes), but still, I wondered whether I could get closer to the true frequency than last time. I also thought I should use a different example frequency so nobody gets the wrong idea that this is somehow only good for integer frequencies. (Integers are discrete, but discreteness goes beyond just integers.)

So, let's use \((440)  2^{1/12}\) or approximately 466.16 Hz.


Fig. 1. Frequency of 466.16 Hz.

Fig. 2. Zooming in a little closer on the points.

So, let's take these same points and try a cubic spline.


Fig. 3. Cubic spline view of things. Still looks fake. But less fake in the middle.

The cubic spline is helpfully printed out with characteristic functions which are mainly readable. I pick out the one that has the interval of concern, namely, 0.5264689995542815*x^3-738.1894409400452*x^2+345013.3968874384*x-5.374983574143041*10^7. We can use differentiation to find the location of the maximum value.

 

Which gives roots of [x=465.6995765525049,x=469.0682718425062]. Clearly 465.7 is the one we want. We're not any closer than we were last time using linear interpolation, although the general appearance of the graph looks more realistic and we did include another bit of the Maxima library.

Saturday, March 6, 2021

FFT - Local Symmetry Inferred Peak

As I was thinking about the previous post, I thought there might be a way to estimate the location of the peak. That is, to find the location in between the discrete data points where the peak probably occurs. 

I tried using realpart and imagpart and abs to see the differences and it seemed like realpart gives me the best view of the relative amplitudes when there are multiple frequencies involved. I also decided to apply an index shift since that seems to match the frequency better although I'm not satisfied with it technically yet.

Let's zoom in on the location of one of the peaks and see what it looks like:

Fig. 1. Looks like we should be able to make a better guess than just picking one of the points.


Here are the actual points as frequency, absolute, real part value pairs:

[[427.972412109375,1.373307119161833],[430.6640625,1.783783923237471],[433.355712890625,2.525090138349324],[436.04736328125,4.273342049313825],[438.739013671875,13.47741003943344],[441.4306640625,11.94536977902466],[444.122314453125,4.166745114303215],[446.81396484375,2.532438127248608],[449.505615234375,1.822956914133206],[452.197265625,1.426082632315385],[454.888916015625,1.172306075942115]]

For a first crack at it, we might try linear interpolation on the two pairs of points on either side of the gap that contains the peak. Based on the Wikipedia article of the continuous version, we are certainly deviating from the shape of the real thing. Dauntless we press on to see what we get, because maybe we can live with a slight improvement that we know isn't perfect.

So, here we go do linear interpolation and find the line intersection using Maxima.

[[x=439.729058165623,y=16.86285595968834]]

Let's put that point in the middle and see what it looks like:

 
Fig. 2. Hmm, closer, but looks kinda fake to me.

Somewhat predictably, this looks as fake as it really is. I think this graph makes it just how clear that linear interpolation is slightly bogus here. Not completely bogus though, it got us closer, right? So, we've ended up on the left side of the true peak which we know should happen exactly at 440 Hz. Why? The slope on the right side of the true peak is further away from the peak and therefore has a lower (absolute) slope value—it is less steep than it would be if it was closer to the peak. This is the weakness of linear interpolating this. We will end up closer to the higher of the two near-peak points than we should.

To get closer, we might want to try a different type of interpolation. Maybe a cubic spline?