Category Archives: Machine Learning

On the perils of learning in high dimensions

dune_travel

In the science fiction novel Dune, the author, Frank Herbert describes a means of space travel called "folding space". The idea is that instead of trying to travel faster than light, you could bend 3-dimensional space into a 4th or higher dimension and appear wherever you wanted to go without actually moving. The people that do this sort of thing in the book are highly trained and intelligent because folding space is very dangerous.

Like the imaginary travel in Dune, machine learning poses its own set of dangers when confronted by high dimensions. This is known as the Curse of Dimensionality and it is regarded as the second biggest challenge in ML after overfitting.

So why is this such a problem? Because as we increase the number of variables, and hence dimensions, the amount of data required to effectively train a learner grows exponentially. To convince you that this is true, I'll start with a brief informal discussion, followed by an empirical treatment with one of my favorite tools, the R programming language. For a formal, mathematical treatment, I refer you to a beautiful paper by Mario Koppen. What I aim to do here though is hopefully introduce the topic in a slightly more approachable way for those from a computer science background like myself.

The first thing to consider is the vastness of high dimensional spaces. Just as there are an infinite number of planes that fill a cube, so too there are an infinite number of cubes that fill a hypercube. This is problem of size makes sense intuitively and by itself might seem to explain the need for a lot more data in high dimensions. Size isn't the only thing going on though. We have to understand how the locations of points in a hypercube relative to one another change with increasing dimensions. To understand these relationships we need to first think about how  volumes change when moving from 3 to 4 dimensions and higher.

inscribed_sphere 1

Let's consider what happens to the ratio of the volumes of an n-ball and an n-cube as we increase the number of dimensions. Above and on the left, I've drawn a sphere inscribed in a cube such that each side of the cube touches the ball. The ball's radius, r is one-half the width of the cube, S. Assuming a unit cube with S=1 then with the equation for the volume of a sphere,  V=\frac{4}{3}\pi r^3 we get a volume of 0.524 for the inscribed sphere.

We might be tempted to think that this ratio would stay the same as the number of dimensions, n, increases, but we would be wrong. To demonstrate let's look at some R code:

This computes the volume of a hypersphere for any number of dimensions and radius. If we run this code for different values of n and r=0.5 and plot the results we get the following:

volume_ratios

Wow! So the volume of the hypersphere becomes infinitely small while the volume of the hypercube with S=1 remains constant. If we could somehow visualize what this situation looks like in high dimensions we would see what Koppen referred to as a "spiking hypercube" with the corners of the cube extending further and further away from the center which gets increasingly smaller. This means that with increasing dimensions, our data will almost never fall within the center of the volume. It will almost always be near the edges!

spiking_ncube 1

Another problem is that the distribution of distances between randomly chosen points in the volume begins to change. We can simulate this with the following bit of R code:

Running this code for a few different settings of n, with 1000 samples each generates the following histograms:

distance_histograms

What's happening here, is that as the number of dimensions increases, the variance of the distances around a mean remains relatively unchanged even though the mean itself is increasing. To further illustrate we can plot the mean and standard deviation of distance for 1000 sample points for all number of dimensions between 2 and 1000.

mean_distance

sd_distance

This means that in high dimensions, other points will tend to be clustered at similar distances from any other point. This is why instance based methods, like knn tend to have difficulty in high dimensions. As the number of dimensions increases, the distance to other points becomes increasingly similar until the choice of the k nearest points becomes random.

Well, I hope this brief post helped you understand this very important challenge in ML. I feel like writing it helped improve my own understanding. If this topic interests you, I would definitely encourage you to read the papers by Koppen and Domingos that I linked to previously as they give a much more formal and authoritative treatment . You can find the complete source code to reproduce these experiments here.

Classification of Multivariable Time Series

Pretty much all of the example applications of classification that I've seen given in textbooks and things like Coursera deal with data sets where the response variable of each instance is (or is at least assumed to be) independent of the response variable of every other instance. This is true for problems like spam classification or sentiment analysis of product reviews. This makes model selection via cross validation much easier (among other things). In this kind of problem it's common to see a variable setup like this:

\hat{y}=x_0+\theta_1x_1 +... + \theta_nx_n

Where x_i's are the explanatory variables, \hat{y} is the predicted response, and \theta_i's are the fitted model weights.

A lot of data in the world doesn't obey this convenient rule, though. Think of the problem of churn prediction at a phone company. In this case, it's convenient to measure both response and explanatory variables over fixed time intervals. For example, we might consider things like minutes used, MB of data used, overage charges and so forth as explanatory variables. For the response, we'd be interested in whether or not the client cancels their account during the following time interval. After all, it doesn't really do the company's sales staff any favors if the algorithm isn't giving them enough time to take preventive measures to save a client in danger of canceling. The model setup in this sort of example might look like this:

\hat{y}^{t+1}=x_0 + \theta_1minutes^{t}+\theta_2minutes^{t-1}+ \theta_3data^{t}+\theta_4data^{t-1}

Notice that here we have a vector of lagged time series for each explanatory variable. In practice, it might be helpful to have longer vectors than what's been displayed for convenience here.

One issue that tends to crop up in problems like this is leakage. Leakage is an unwanted effect where information from the training set contaminates the test set and its known as one of the top 10 mistakes in data mining. This happens because somewhere there is an association between examples in the test set and examples in the training set and since we're dealing with time series data, our examples are by definition not independent.

What to do? In my experience, I've found that as long as the model setup follows the form given in the second equation above, things are generally ok. There are a couple of gotcha's I've found, however. One is the inclusion of variables that are constant over time. For example, in our phone company churn problem, we may believe that certain geographic markets are more stable than others and so we might want to include the latitude and longitude coordinates of our clients like so:

\hat{y}^{t+1}=x_0 + \theta_1minutes^{t}+\theta_2minutes^{t-1}+ \theta_3data^{t}+\theta_4data^{t-1}+\theta_5latitude+\theta_6longitude

This may not always be a problem, but if we're using a nonlinear model that produces very tight local fits (like random forests, for example) then it's entirely possible that examples in the training set may correlate with the test set in a way that just wouldn't be possible when the model is being used in production. If the results in production are much less accurate than in the test set, it may be worth investigating the possibility of leakage.

Another case where leakage can occur is if we modify the time window over \hat{y}, for example, if we wanted to predict over the next 3 time intervals so that \hat{y}^{t+1} becomes: \hat{y}^{t+1...t+3}. One might want to do this in order to mitigate noise in the response. This is usually fine during cross-validation as the main effect it has is to oversample the positive class. However, when testing on the holdout examples, one needs to make sure there is an adequate time offset between the two groups. For example, if we have time intervals of one day and a prediction interval of 3 days, then we would want to separate the training and holdout groups by at least 3 days.

Well, I hope this post has been helpful. Thank you for stopping by. I'm learning more all the time, so I'd be happy to hear any corrections or criticism in the comments. I've recently activated spam filtering in this blog, so hopefully they won't get lost!

PCA on the MNIST data

So far, k-NN has turned out to be a reasonably accurate, if rather slow performing classifier on the MNIST dataset. At over 20 minutes to compute the results for the test data set on my iMac, and even longer when one takes into cross-validation for debugging on training data, it's clear that such a research approach isn't sustainable.

Write more clearly, people!!!

Example 1's from the training data set. Notice the wide variety of examples and how they are transformed: rotation, translation, fattening, etc.

How might this be remedied? One possibility is through data compression. By compressing the image data in much the same way as everyday image files are compressed, we can greatly reduce both the amount of computation and space required to perform k-NN.

Previously,  we saw that the distribution of variance across pixels is fairly wide.  If there were some way to create a new feature set in which the variance is compressed into a small sub-set of the features, then we might be able achieve our goal. As it turns out, such techniques do indeed exist. The one we are interested in today is called Principal Component Analysis (PCA). This is a technique based on linear algebra. I'm not going to describe it in detail today, but the major idea is that it finds a new set of basis vectors in which most of the variance falls within or near a small number of dimensions. Shown below, is the distribution of variance for the training examples for the first 100 features.

generated_feature_variance

Most of the variance seems to be compressed in the first 40 or so features. This is a big improvement. Instead of dealing with 784 dimensions we are probably looking at less than one tenth that number. It's important to keep in mind that these new features don't have any meaning in normal everyday terms. They certainly don't correspond to 'pixels' anymore. Also, the compression process is lossy for any number of new features less than our original number, so we are losing some information. As shown below, reconstructing an example digit from the new basis results in a digit that is "fuzzier".

Original example

Original example

Reconstructed example

Reconstructed example

 

So how did we do on the training data? After trying a few different values for the number of PCA components, N, I settled on a value of 36 as being quite good. This value resulted in an execution time of around 37 seconds, a big improvement from before! This allowed me to rapidly test a number of configurations for both N and k.  Going to significantly fewer dimensions however (16 for example), had the opposite effect and actually hurt accuracy. The results on the cross validated training set are shown below.

Performance with various numbers of components for PCA (lower is better)

Performance with various numbers of components for PCA (lower is better)

Surprisingly (or perhaps not), it turns out that not only did using PCA improve the speed of computation, it also resulted in greater accuracy! My submission from this round of tweaks came in at a respectable  97.5% accuracy. Hooray!

In upcoming posts, I plan to delve into the details of how PCA actually works and why it's possible to not only get a speed improvement but accuracy as well by effectively removing variables from the data.

Some variations on k-NN

I decided to try a few more things with k-NN on the MNIST data, namely varying k and trying the same settings of k for both distance and uniform metrics. For each experiment, I used k-fold cross validation with 5 folds. It turned out that setting k=6 with distance metric gave the best overall performance at just a smidgen under a respectable 97% accuracy (96.957% to be exact). The results of testing the various settings are shown in the plot below.

knn_prediction_results

It's interesting to compare this plot to a similar plot on page 17 of Hastie et al. As expected, performance degrades with increasing k beyond 10 or so. However, there is no corresponding degradation with very low k, which would be a result of overfitting. My guess is that overfitting is happening, but isn't noticeable in this plot because the Curse of Dimensionality is preventing the "sweet spot" (which I would guess is in the k=4-8 range) from maximizing its effectiveness.

On a side note, its also interesting to look at the confusion matrix resulting from one of these experiments. For those not familiar, a confusion matrix simply plots predicted labels (in our case the digits 0-9) vs. actual labels. That is, what the algorithm thought something was, versus what it actually is. As shown below, the algorithm correctly predicted '1' almost all of the time while it had a bit more trouble with the digits 5 and 8.

knn_confusion_matrix

In a future post, I'm planning to explore some dimensionality reduction techniques to see if it's possible to squeeze even more performance out of k-NN!

 

A look at k-NN for digit recognition

I recently entered a submission to the Kaggle digit recognition contest based on the scikit-learn implementation of the k-nearest neighbor algorithm which resulted in a modest 96.6% accuracy. Hastie et. al. define this algorithm as:

\hat{Y}\left(x\right)=\frac{1}{k}\sum\limits_{x_i \in N_k\left(x\right)}y_i

This simply says that we are looking at all of the training samples within a neighborhood and averaging their values to determine an estimate, \hat{Y} for a test sample in question. I arbitrarily chose  a neighborhood size of 10 and the default Minkowski distance with p=2 (Euclidean) as follows:

digit_reader_knn_source

Execution took about 20 minutes and 53 seconds my computer's 3.4 GHz i7 processor: quite a bit of time, but perhaps not surprising as k-NN is sometimes referred to as a "Lazy Learner". In other words, it postpones generalization until the actual query is made. In this case, there are 28,000 such queries and 784 dimensions per neighbor in each query (one dimension per pixel) so that's a lot of computation to delay!

Obviously, the sensible approach in this case would be to find some way to reduce the number of dimensions if possible.  One can look at the amount of information that each pixel encodes in the training examples. Looking around the web, I found one article on this same problem and so, I attempted to follow the author's Exploratory Data Analysis technique by looking at the standard deviation and Coefficient of Variation across pixels within the training examples using the following python code:

ipython_knn_desc_stats

Plotting the standard deviation gave the following result which matched the original author's work nicely:

knn_stddev

And then calculating CoV:

ipython_knn_cov

 

produced the following plot:

knn_CoV_linear_scale

Whoa! That doesn't look right at all. Or does it? It turns out that quite a few of the pixels have means close to zero which result in rather large numbers for something that should be a normalized quantity. Let's look at the same data plotted with a logarithmic y axis:

knn_CoV_log

Well that's a bit easier to make sense of. Unfortunately though, it doesn't seem like CoV is a very good statistic to use for our purposes here because so many of the values are large. I plan to return soon with more results!