May | 2014 | Beginner's Mind

I decided to try a few more things with k-NN on the MNIST data, namely varying k and trying the same settings of k for both distance and uniform metrics. For each experiment, I used k-fold cross validation with 5 folds. It turned out that setting k=6 with distance metric gave the best overall performance at just a smidgen under a respectable 97% accuracy (96.957% to be exact). The results of testing the various settings are shown in the plot below.

It's interesting to compare this plot to a similar plot on page 17 of Hastie et al. As expected, performance degrades with increasing k beyond 10 or so. However, there is no corresponding degradation with very low k, which would be a result of overfitting. My guess is that overfitting is happening, but isn't noticeable in this plot because the Curse of Dimensionality is preventing the "sweet spot" (which I would guess is in the k=4-8 range) from maximizing its effectiveness.

On a side note, its also interesting to look at the confusion matrix resulting from one of these experiments. For those not familiar, a confusion matrix simply plots predicted labels (in our case the digits 0-9) vs. actual labels. That is, what the algorithm thought something was, versus what it actually is. As shown below, the algorithm correctly predicted '1' almost all of the time while it had a bit more trouble with the digits 5 and 8.

In a future post, I'm planning to explore some dimensionality reduction techniques to see if it's possible to squeeze even more performance out of k-NN!

I recently entered a submission to the Kaggle digit recognition contest based on the scikit-learn implementation of the k-nearest neighbor algorithm which resulted in a modest 96.6% accuracy. Hastie et. al. define this algorithm as:

$\hat{Y}\left(x\right)=\frac{1}{k}\sum\limits_{x_i \in N_k\left(x\right)}y_i$

This simply says that we are looking at all of the training samples within a neighborhood and averaging their values to determine an estimate, $\hat{Y}$ for a test sample in question. I arbitrarily chose a neighborhood size of 10 and the default Minkowski distance with p=2 (Euclidean) as follows:

Execution took about 20 minutes and 53 seconds my computer's 3.4 GHz i7 processor: quite a bit of time, but perhaps not surprising as k-NN is sometimes referred to as a "Lazy Learner". In other words, it postpones generalization until the actual query is made. In this case, there are 28,000 such queries and 784 dimensions per neighbor in each query (one dimension per pixel) so that's a lot of computation to delay!

Obviously, the sensible approach in this case would be to find some way to reduce the number of dimensions if possible. One can look at the amount of information that each pixel encodes in the training examples. Looking around the web, I found one article on this same problem and so, I attempted to follow the author's Exploratory Data Analysis technique by looking at the standard deviation and Coefficient of Variation across pixels within the training examples using the following python code:

Plotting the standard deviation gave the following result which matched the original author's work nicely:

And then calculating CoV:

produced the following plot:

Whoa! That doesn't look right at all. Or does it? It turns out that quite a few of the pixels have means close to zero which result in rather large numbers for something that should be a normalized quantity. Let's look at the same data plotted with a logarithmic y axis:

Well that's a bit easier to make sense of. Unfortunately though, it doesn't seem like CoV is a very good statistic to use for our purposes here because so many of the values are large. I plan to return soon with more results!

Beginner's Mind

Some possibilities of statistical learning.

Monthly Archives: May 2014

Some variations on k-NN

A look at k-NN for digit recognition