Author Archives: brad6624

On the perils of learning in high dimensions

dune_travel

In the science fiction novel Dune, the author, Frank Herbert describes a means of space travel called "folding space". The idea is that instead of trying to travel faster than light, you could bend 3-dimensional space into a 4th or higher dimension and appear wherever you wanted to go without actually moving. The people that do this sort of thing in the book are highly trained and intelligent because folding space is very dangerous.

Like the imaginary travel in Dune, machine learning poses its own set of dangers when confronted by high dimensions. This is known as the Curse of Dimensionality and it is regarded as the second biggest challenge in ML after overfitting.

So why is this such a problem? Because as we increase the number of variables, and hence dimensions, the amount of data required to effectively train a learner grows exponentially. To convince you that this is true, I'll start with a brief informal discussion, followed by an empirical treatment with one of my favorite tools, the R programming language. For a formal, mathematical treatment, I refer you to a beautiful paper by Mario Koppen. What I aim to do here though is hopefully introduce the topic in a slightly more approachable way for those from a computer science background like myself.

The first thing to consider is the vastness of high dimensional spaces. Just as there are an infinite number of planes that fill a cube, so too there are an infinite number of cubes that fill a hypercube. This is problem of size makes sense intuitively and by itself might seem to explain the need for a lot more data in high dimensions. Size isn't the only thing going on though. We have to understand how the locations of points in a hypercube relative to one another change with increasing dimensions. To understand these relationships we need to first think about how  volumes change when moving from 3 to 4 dimensions and higher.

inscribed_sphere 1

Let's consider what happens to the ratio of the volumes of an n-ball and an n-cube as we increase the number of dimensions. Above and on the left, I've drawn a sphere inscribed in a cube such that each side of the cube touches the ball. The ball's radius, r is one-half the width of the cube, S. Assuming a unit cube with S=1 then with the equation for the volume of a sphere,  V=\frac{4}{3}\pi r^3 we get a volume of 0.524 for the inscribed sphere.

We might be tempted to think that this ratio would stay the same as the number of dimensions, n, increases, but we would be wrong. To demonstrate let's look at some R code:

This computes the volume of a hypersphere for any number of dimensions and radius. If we run this code for different values of n and r=0.5 and plot the results we get the following:

volume_ratios

Wow! So the volume of the hypersphere becomes infinitely small while the volume of the hypercube with S=1 remains constant. If we could somehow visualize what this situation looks like in high dimensions we would see what Koppen referred to as a "spiking hypercube" with the corners of the cube extending further and further away from the center which gets increasingly smaller. This means that with increasing dimensions, our data will almost never fall within the center of the volume. It will almost always be near the edges!

spiking_ncube 1

Another problem is that the distribution of distances between randomly chosen points in the volume begins to change. We can simulate this with the following bit of R code:

Running this code for a few different settings of n, with 1000 samples each generates the following histograms:

distance_histograms

What's happening here, is that as the number of dimensions increases, the variance of the distances around a mean remains relatively unchanged even though the mean itself is increasing. To further illustrate we can plot the mean and standard deviation of distance for 1000 sample points for all number of dimensions between 2 and 1000.

mean_distance

sd_distance

This means that in high dimensions, other points will tend to be clustered at similar distances from any other point. This is why instance based methods, like knn tend to have difficulty in high dimensions. As the number of dimensions increases, the distance to other points becomes increasingly similar until the choice of the k nearest points becomes random.

Well, I hope this brief post helped you understand this very important challenge in ML. I feel like writing it helped improve my own understanding. If this topic interests you, I would definitely encourage you to read the papers by Koppen and Domingos that I linked to previously as they give a much more formal and authoritative treatment . You can find the complete source code to reproduce these experiments here.

Classification of Multivariable Time Series

Pretty much all of the example applications of classification that I've seen given in textbooks and things like Coursera deal with data sets where the response variable of each instance is (or is at least assumed to be) independent of the response variable of every other instance. This is true for problems like spam classification or sentiment analysis of product reviews. This makes model selection via cross validation much easier (among other things). In this kind of problem it's common to see a variable setup like this:

\hat{y}=x_0+\theta_1x_1 +... + \theta_nx_n

Where x_i's are the explanatory variables, \hat{y} is the predicted response, and \theta_i's are the fitted model weights.

A lot of data in the world doesn't obey this convenient rule, though. Think of the problem of churn prediction at a phone company. In this case, it's convenient to measure both response and explanatory variables over fixed time intervals. For example, we might consider things like minutes used, MB of data used, overage charges and so forth as explanatory variables. For the response, we'd be interested in whether or not the client cancels their account during the following time interval. After all, it doesn't really do the company's sales staff any favors if the algorithm isn't giving them enough time to take preventive measures to save a client in danger of canceling. The model setup in this sort of example might look like this:

\hat{y}^{t+1}=x_0 + \theta_1minutes^{t}+\theta_2minutes^{t-1}+ \theta_3data^{t}+\theta_4data^{t-1}

Notice that here we have a vector of lagged time series for each explanatory variable. In practice, it might be helpful to have longer vectors than what's been displayed for convenience here.

One issue that tends to crop up in problems like this is leakage. Leakage is an unwanted effect where information from the training set contaminates the test set and its known as one of the top 10 mistakes in data mining. This happens because somewhere there is an association between examples in the test set and examples in the training set and since we're dealing with time series data, our examples are by definition not independent.

What to do? In my experience, I've found that as long as the model setup follows the form given in the second equation above, things are generally ok. There are a couple of gotcha's I've found, however. One is the inclusion of variables that are constant over time. For example, in our phone company churn problem, we may believe that certain geographic markets are more stable than others and so we might want to include the latitude and longitude coordinates of our clients like so:

\hat{y}^{t+1}=x_0 + \theta_1minutes^{t}+\theta_2minutes^{t-1}+ \theta_3data^{t}+\theta_4data^{t-1}+\theta_5latitude+\theta_6longitude

This may not always be a problem, but if we're using a nonlinear model that produces very tight local fits (like random forests, for example) then it's entirely possible that examples in the training set may correlate with the test set in a way that just wouldn't be possible when the model is being used in production. If the results in production are much less accurate than in the test set, it may be worth investigating the possibility of leakage.

Another case where leakage can occur is if we modify the time window over \hat{y}, for example, if we wanted to predict over the next 3 time intervals so that \hat{y}^{t+1} becomes: \hat{y}^{t+1...t+3}. One might want to do this in order to mitigate noise in the response. This is usually fine during cross-validation as the main effect it has is to oversample the positive class. However, when testing on the holdout examples, one needs to make sure there is an adequate time offset between the two groups. For example, if we have time intervals of one day and a prediction interval of 3 days, then we would want to separate the training and holdout groups by at least 3 days.

Well, I hope this post has been helpful. Thank you for stopping by. I'm learning more all the time, so I'd be happy to hear any corrections or criticism in the comments. I've recently activated spam filtering in this blog, so hopefully they won't get lost!

PCA on the MNIST data

So far, k-NN has turned out to be a reasonably accurate, if rather slow performing classifier on the MNIST dataset. At over 20 minutes to compute the results for the test data set on my iMac, and even longer when one takes into cross-validation for debugging on training data, it's clear that such a research approach isn't sustainable.

Write more clearly, people!!!

Example 1's from the training data set. Notice the wide variety of examples and how they are transformed: rotation, translation, fattening, etc.

How might this be remedied? One possibility is through data compression. By compressing the image data in much the same way as everyday image files are compressed, we can greatly reduce both the amount of computation and space required to perform k-NN.

Previously,  we saw that the distribution of variance across pixels is fairly wide.  If there were some way to create a new feature set in which the variance is compressed into a small sub-set of the features, then we might be able achieve our goal. As it turns out, such techniques do indeed exist. The one we are interested in today is called Principal Component Analysis (PCA). This is a technique based on linear algebra. I'm not going to describe it in detail today, but the major idea is that it finds a new set of basis vectors in which most of the variance falls within or near a small number of dimensions. Shown below, is the distribution of variance for the training examples for the first 100 features.

generated_feature_variance

Most of the variance seems to be compressed in the first 40 or so features. This is a big improvement. Instead of dealing with 784 dimensions we are probably looking at less than one tenth that number. It's important to keep in mind that these new features don't have any meaning in normal everyday terms. They certainly don't correspond to 'pixels' anymore. Also, the compression process is lossy for any number of new features less than our original number, so we are losing some information. As shown below, reconstructing an example digit from the new basis results in a digit that is "fuzzier".

Original example

Original example

Reconstructed example

Reconstructed example

 

So how did we do on the training data? After trying a few different values for the number of PCA components, N, I settled on a value of 36 as being quite good. This value resulted in an execution time of around 37 seconds, a big improvement from before! This allowed me to rapidly test a number of configurations for both N and k.  Going to significantly fewer dimensions however (16 for example), had the opposite effect and actually hurt accuracy. The results on the cross validated training set are shown below.

Performance with various numbers of components for PCA (lower is better)

Performance with various numbers of components for PCA (lower is better)

Surprisingly (or perhaps not), it turns out that not only did using PCA improve the speed of computation, it also resulted in greater accuracy! My submission from this round of tweaks came in at a respectable  97.5% accuracy. Hooray!

In upcoming posts, I plan to delve into the details of how PCA actually works and why it's possible to not only get a speed improvement but accuracy as well by effectively removing variables from the data.

Some variations on k-NN

I decided to try a few more things with k-NN on the MNIST data, namely varying k and trying the same settings of k for both distance and uniform metrics. For each experiment, I used k-fold cross validation with 5 folds. It turned out that setting k=6 with distance metric gave the best overall performance at just a smidgen under a respectable 97% accuracy (96.957% to be exact). The results of testing the various settings are shown in the plot below.

knn_prediction_results

It's interesting to compare this plot to a similar plot on page 17 of Hastie et al. As expected, performance degrades with increasing k beyond 10 or so. However, there is no corresponding degradation with very low k, which would be a result of overfitting. My guess is that overfitting is happening, but isn't noticeable in this plot because the Curse of Dimensionality is preventing the "sweet spot" (which I would guess is in the k=4-8 range) from maximizing its effectiveness.

On a side note, its also interesting to look at the confusion matrix resulting from one of these experiments. For those not familiar, a confusion matrix simply plots predicted labels (in our case the digits 0-9) vs. actual labels. That is, what the algorithm thought something was, versus what it actually is. As shown below, the algorithm correctly predicted '1' almost all of the time while it had a bit more trouble with the digits 5 and 8.

knn_confusion_matrix

In a future post, I'm planning to explore some dimensionality reduction techniques to see if it's possible to squeeze even more performance out of k-NN!

 

A look at k-NN for digit recognition

I recently entered a submission to the Kaggle digit recognition contest based on the scikit-learn implementation of the k-nearest neighbor algorithm which resulted in a modest 96.6% accuracy. Hastie et. al. define this algorithm as:

\hat{Y}\left(x\right)=\frac{1}{k}\sum\limits_{x_i \in N_k\left(x\right)}y_i

This simply says that we are looking at all of the training samples within a neighborhood and averaging their values to determine an estimate, \hat{Y} for a test sample in question. I arbitrarily chose  a neighborhood size of 10 and the default Minkowski distance with p=2 (Euclidean) as follows:

digit_reader_knn_source

Execution took about 20 minutes and 53 seconds my computer's 3.4 GHz i7 processor: quite a bit of time, but perhaps not surprising as k-NN is sometimes referred to as a "Lazy Learner". In other words, it postpones generalization until the actual query is made. In this case, there are 28,000 such queries and 784 dimensions per neighbor in each query (one dimension per pixel) so that's a lot of computation to delay!

Obviously, the sensible approach in this case would be to find some way to reduce the number of dimensions if possible.  One can look at the amount of information that each pixel encodes in the training examples. Looking around the web, I found one article on this same problem and so, I attempted to follow the author's Exploratory Data Analysis technique by looking at the standard deviation and Coefficient of Variation across pixels within the training examples using the following python code:

ipython_knn_desc_stats

Plotting the standard deviation gave the following result which matched the original author's work nicely:

knn_stddev

And then calculating CoV:

ipython_knn_cov

 

produced the following plot:

knn_CoV_linear_scale

Whoa! That doesn't look right at all. Or does it? It turns out that quite a few of the pixels have means close to zero which result in rather large numbers for something that should be a normalized quantity. Let's look at the same data plotted with a logarithmic y axis:

knn_CoV_log

Well that's a bit easier to make sense of. Unfortunately though, it doesn't seem like CoV is a very good statistic to use for our purposes here because so many of the values are large. I plan to return soon with more results!

2014 reboot

I've been looking for a project to take on in my personal time and have been extremely indecisive about it for some time now, but I think I've got an idea now.

Over the Christmas holidays I spent some time playing a wonderful iOS game called QuestLord which has gotten me interested in iOS/graphics and C++ programming again. So I've dusted off my old C++ vector math library and started hacking things around a bit.

Some of my goals for this project:

1. Extremely fast compile time. Aside from simply working, this is the most important thing to me.

2. Processor and memory efficiency. These are also very important to me from a technical stand point except when they conflict with #1. I suspect this might happen when using say, header files for implementing inline types such as vectors and matrices but I'm not sure yet.

3. Elegant design of classes and function APIs with minimal code. Also very important except when conflicting with 1 or 2 above. One possible case I can think of where there might be a conflict with #1 is over reliance on templates, but I'll just have to see.

4. Achieve mastery of C++11 and refresh knowledge of C++ in general. This is not necessarily the least important thing on my list.

Hmmm.... well that's a lot of technical goals and no goals relating to algorithms or math yet, but I'm sure I'll think of some.

The New World

It is said that when Hernando Cortes landed in the Yucatan with only 600 men to confront the full might of the Aztec empire he made the radical decision to burn the ships of his fleet, removing any possibility of retreat or return to home save by acquiring new ships as spoils of war. It's hard to overstate the motivation one experiences when failure is no longer an option and this is something I feel like I can say I've truly experienced in recent months.

The comfort and security of the familiar is like a warm campfire that lulls us to sleep. Then, later... perhaps much later, like the gunslinger in Stephen King's novels, we awaken to find the campfire has long since died out, we are much older than before, and lost in a strange new world.

But good or bad, change is the agent that restores our awareness. It brings color and vividness to our lives. It challenges us to grow, to become more than what we would be otherwise. As Frank Herbert wrote, "Without change something sleeps inside us and seldom awakens. The sleeper must awaken!" I believe the sleeper he was referring to is the mythical hero inside ourselves. It's the hero that Joseph Campbell wrote about. The hero that calls us on a journey away from the comfortable and familiar.

I feel like my own journey is beginning once again. No doubt there will be many adventures to come. I am truly thankful to all my friends. I hope you know who you are.

With that, I leave you with this song that to me evokes the spirit of the mythical hero.

Schiller - Night Flight

 

How deep is your family tree?

Today, my friends and I got to talking about the age of the human race under both Evolutionary and Biblical timelines. In particular there was the question of whether or not the roughly 6,000 years of Biblical history allow sufficient time for the human population to grow from 2 to approximately 7 billion. I was curious about this and did some back of the envelope calculations along with some SWAGs (emphasis on the 'WA').

For the population model I assumed that r^n=p where p is the current population, n is the number of generations since Adam and Eve were banished from the garden, and r is the average rate of children produced per parent that go on to produce children of their own. First, I assumed that r would be something like 1.5, since people have historically had big families but things like war, plague, and famine substantially decrease the number of offspring that have children of their own. Finally, n is the number of generations. Typing this into wolfram alpha as log(1.5, 7000000000) results in just under 56 generations which doesn't seem like that long. If one assumes that the average age of child birth is 25, then that results in 1,400 years. More than enough time to achieve the current population.

What if we take 25 and plug it into a model based on the Evolution timeline? According to wikipedia, the current iteration of humanity, also known as homo sapiens sapiens, has been around for about 200,000 years and that would mean about 8,000 generations. Although I haven't had any life sciences education since highschool, that seems like a plausible number of generations within which to achieve the ethnic diversity currently present in humanity. Anyway plugging these numbers back into our model, we want to solve r=p^{\frac{1}{n}}, we can type 7000000000^(1/8000) into wolfram alpha giving us about 1.003.

This is an unsettlingly low number for r, but it seems plausible to me. For centuries, life for people has been, as Thomas Hobbes used to say, "nasty, brutish, and short".

I will do my best to make future blog posts for the new year more cheerful :-).

Finally gitting it

I decided to take a day off from work today and turn the 3 day memorial day weekend into a 4 day one in order to take care of some house keeping. One of the things that's been on my to do list has been to open source the code from my Master's project and thesis from a few years back. It's now available on git-hub under the MIT license. I probably should have done this a long time ago, but at the time, I wasn't sure how, or whether I wanted to continue development of it after graduation. Since I haven't done anything productive with it since, maybe it will be useful to someone as open source and I figure this way it's also a lot less likely to get lost/forgotten. This article on open source scientific research was also quite persuasive.

I'm looking into releasing a couple more of my home brew projects on git-hub - namely my 2012 Google AI Challenge entry and also my graphics programming "framework", so stay tuned.

Art and stuff

It's amazing how fast time goes by. So I've had a bit of a hiatus and its time to start blogging again. I wish that I had something amazing to show for the past few months but unfortunately no. But I am doing hobby programming again and that's a good thing.

I've just recently (as of about 2 weeks ago) started working on an iPhone game using the wonderful cocos2d engine. I've also just started trying to develop some rudimentary artistic skills which will hopefully come in handy. So I thought I would post some of the drawings I've been working on... not so much for your appreciation gentle reader (since it's pretty horrible art), but for my own personal record.  I've just started working through Bert Dodson's excellent Keys to Drawing. So I'll be posting my projects from the book here and hopefully there will be some improvement over the course of the next few weeks.

Here are the results from the first two projects:

Yes, those are my feet. No, they really aren't that ugly. Not quite anyway.

Here the result from the second project (1b):

My hand (sort of), an orchid bloom, and some ivy.

So I've got a ways to go but we all have to start somewhere. One of the things that I've already noticed in these exercises is how much this kind of drawing feels like a meditation. Indeed, Dodson himself makes this Zen-like comment in the first chapter:

We should draw, for the time being at least, as if we know nothing, and were obedient only to what our eye tells us draw. This is the key to natural, life-like drawing. To understand this is to understand that there is no such thing as knowing how to draw something. One hears, "Can you draw hands -- or horses -- or trees?" The answer is: we do not draw "things" at all, only lines.

Eloquent and inspiring words from a master to a beginner.