Jan 22 2011
You know what this picture tells me?
It tells me that BAC, KEY, MI, RF, SNV, and STI are related. They’re all acronyms, yes, but more than that. They are all banking stock symbols. The node that the arrow points to contains these values. All banks. The nodes on either side contain exclusively real estate holding companies and home builders respectively.
This puts a huge grin on my face.
If you’ve talked to me in the last several months, I’ll have probably mentioned at some point that I’ve been learning some machine learning concepts. I’ve been watching and working through the examples in the Stanford machine learning course by Andrew Ng. On the one hand, the course is excellent. Andrew clearly knows his stuff and teaches toward underlying theory and principle. His goals are exactly how I like to approach a new area of study; always asking why not how. The concepts are compelling but on the other hand the math is difficult for me and it is lacking the kind of “proof is in the pudding” mentality that I’m used to as a programmer. I decided that I also need to approach this topic from the practical side. I’ve just discovered RapidMiner and have been playing with it recently.
My chosen problem set is stock data. I love the uncertainty inherent in the market, its a mass of data, action and reaction. One problem (an easy one to start with) that I’ve always wanted to work on is stock correlation. Simply put, stocks that are similar move together. If you have two businesses that are similar in industry and size, they will likely move together as they have similar economic environment. News that affects one is much more likely to affect the other than a third unrelated company in another industry. This relationship can be coaxed out of the data. For each stock, you should be able to calculate a web of close “neighbors” that move similarly, and moving out from there you may approach another “neighborhood” of related stocks. In machine learning this problem could be approached as a time series or as clustering. Since we don’t know the labels (names of the clusters so to speak) it’s not classification.
After many false starts, I grabbed some stock data and loaded it into RM with the built in jdbc tools they have. I then selected stocks for the last two years without missing data points on days where volume was greater than zero and pulled the set into a hierarchical clustering algorithm. The hierarchical clusterer uses an internal simpler one level cluster (kmeans) and applies it it recursively and in parallel. I also had some promising results with a correlation matrix which showed for example that INTC (Intel) and MU (Micron) are related much more closely than MU and KFT (Kraft).
Its great to be able to test out some of the things that I’ve been learning about in an environment that lets me try a lot of things in a relatively short amount of time. Up next: Better clustering algorithms, and using class labels from my clustering to train a model for prediction.