Jan 22 2011

RapidMiner and Machine Learning

Published by jfrank at 11:54 pm under groovy, open source

You know what this picture tells me?

It tells me that BAC, KEY, MI, RF, SNV, and STI are related. They’re all acronyms, yes, but more than that. They are all banking stock symbols. The node that the arrow points to contains these values. All banks. The nodes on either side contain exclusively real estate holding companies and home builders respectively.

This puts a huge grin on my face.

If you’ve talked to me in the last several months, I’ll have probably mentioned at some point that I’ve been learning some machine learning concepts. I’ve been watching and working through the examples in the Stanford machine learning course by Andrew Ng. On the one hand, the course is excellent. Andrew clearly knows his stuff and teaches toward underlying theory and principle. His goals are exactly how I like to approach a new area of study; always asking why not how. The concepts are compelling but on the other hand the math is difficult for me and it is lacking the kind of “proof is in the pudding” mentality that I’m used to as a programmer. I decided that I also need to approach this topic from the practical side. I’ve just discovered RapidMiner and have been playing with it recently.

My chosen problem set is stock data. I love the uncertainty inherent in the market, its a mass of data, action and reaction. One problem (an easy one to start with) that I’ve always wanted to work on is stock correlation. Simply put, stocks that are similar move together. If you have two businesses that are similar in industry and size, they will likely move together as they have similar economic environment. News that affects one is much more likely to affect the other than a third unrelated company in another industry. This relationship can be coaxed out of the data. For each stock, you should be able to calculate a web of close “neighbors” that move similarly, and moving out from there you may approach another “neighborhood” of related stocks. In machine learning this problem could be approached as a time series or as clustering. Since we don’t know the labels (names of the clusters so to speak) it’s not classification.

After many false starts, I grabbed some stock data and loaded it into RM with the built in jdbc tools they have. I then selected stocks for the last two years without missing data points on days where volume was greater than zero and pulled the set into a hierarchical clustering algorithm. The hierarchical clusterer uses an internal simpler one level cluster (kmeans) and applies it it recursively and in parallel. I also had some promising results with a correlation matrix which showed for example that INTC (Intel) and MU (Micron) are related much more closely than MU and KFT (Kraft).

Its great to be able to test out some of the things that I’ve been learning about in an environment that lets me try a lot of things in a relatively short amount of time. Up next: Better clustering algorithms, and using class labels from my clustering to train a model for prediction.

4 responses so far

4 Responses to “RapidMiner and Machine Learning”

  1. Sethon 25 Jan 2011 at 8:29 am

    Wow, this looks really interesting. Definitely something I imagine you’ve invested a lot of time into. What sort of math is used in these algorithms? You’ve got me curious.

  2. jfrankon 25 Jan 2011 at 9:08 am

    Mostly linear algebra but that wasn’t something I took in school though so I’ve also been going through Gilbert Strang’s course on that topic. (from itunes u) With rapidminer though it is about assembling the data and setting parameters. How you well you do that is informed by how well you understand what’s going on underneath. I’m still very much a beginner. There is also bits of programming of course which is cool.

  3. Tom Otton 28 Jan 2011 at 11:04 am

    Kudos to you for finding how powerful RM is. I do quite a bit of stock market modeling with RM. You should check out my tutorial section sometime.

    Good luck!

    Tom

  4. jfrankon 28 Jan 2011 at 12:04 pm

    Thanks Tom,

    I’ve subscribed to your feed. You’ve got some interesting stuff there, I’m going to check it out for sure.

Trackback URI | Comments RSS

Leave a Reply