Jan 22 2011

RapidMiner and Machine Learning

Published by jfrank under groovy, open source

You know what this picture tells me?

It tells me that BAC, KEY, MI, RF, SNV, and STI are related. They’re all acronyms, yes, but more than that. They are all banking stock symbols. The node that the arrow points to contains these values. All banks. The nodes on either side contain exclusively real estate holding companies and home builders respectively.

This puts a huge grin on my face.

If you’ve talked to me in the last several months, I’ll have probably mentioned at some point that I’ve been learning some machine learning concepts. I’ve been watching and working through the examples in the Stanford machine learning course by Andrew Ng. On the one hand, the course is excellent. Andrew clearly knows his stuff and teaches toward underlying theory and principle. His goals are exactly how I like to approach a new area of study; always asking why not how. The concepts are compelling but on the other hand the math is difficult for me and it is lacking the kind of “proof is in the pudding” mentality that I’m used to as a programmer. I decided that I also need to approach this topic from the practical side. I’ve just discovered RapidMiner and have been playing with it recently.

My chosen problem set is stock data. I love the uncertainty inherent in the market, its a mass of data, action and reaction. One problem (an easy one to start with) that I’ve always wanted to work on is stock correlation. Simply put, stocks that are similar move together. If you have two businesses that are similar in industry and size, they will likely move together as they have similar economic environment. News that affects one is much more likely to affect the other than a third unrelated company in another industry. This relationship can be coaxed out of the data. For each stock, you should be able to calculate a web of close “neighbors” that move similarly, and moving out from there you may approach another “neighborhood” of related stocks. In machine learning this problem could be approached as a time series or as clustering. Since we don’t know the labels (names of the clusters so to speak) it’s not classification.

After many false starts, I grabbed some stock data and loaded it into RM with the built in jdbc tools they have. I then selected stocks for the last two years without missing data points on days where volume was greater than zero and pulled the set into a hierarchical clustering algorithm. The hierarchical clusterer uses an internal simpler one level cluster (kmeans) and applies it it recursively and in parallel. I also had some promising results with a correlation matrix which showed for example that INTC (Intel) and MU (Micron) are related much more closely than MU and KFT (Kraft).

Its great to be able to test out some of the things that I’ve been learning about in an environment that lets me try a lot of things in a relatively short amount of time. Up next: Better clustering algorithms, and using class labels from my clustering to train a model for prediction.

4 responses so far

Aug 30 2010

Keeping Dry

Published by jfrank under coldfusion, open source

Did you ever wonder what is shared between all those CF frameworks? I put together a copy paste detector implementation of the interface for cfm and cfc files, as part of the PMD project. I also plugged it into hudson via the DRY visualizer of the output, but that is just candy on top.

Basically it works like this:

  1. It goes through your code line by line, character by character and builds up overlapping hashes of the content.
  2. Based on your tokenizer, it will ignore certain parts of the code (like whitespace) and so the hashes will be able to handle your 2 tabs vs my 3 spaces and shows that two sections of code are ‘the same’. The tokenizer I am using is the ‘anyTokenizer’, not optimized for CF at all.
  3. Based on a threshold of how big a token range the hashes should cover, it builds a report of all the code you give it and outputs it in a structured format.

Now presenting, all of the code duplication between the following ColdFusion projects, run with a token threshold of 200:

  • ColdBricks
  • Coldbox
  • ModelGlue
  • coldmock
  • mura
  • MangoBlog
  • coldspring
  • mxunit
  • machii
  • farcry
  • fusebox5

Well… I’m not going to put them all here. Its a large set.

Here are a few samples:

Farcry and mura both use cfformprotect. Cool.

51,515: <file line=”288″ path=”/home/jfrank/temp/codeprojects/farcry/core/webtop/cffp/cfformprotect/cffpVerify.cfc”/>
51,516: <file line=”287″ path=”/home/jfrank/temp/codeprojects/mura-5.2.2709/www/requirements/cfformprotect/cffpVerify.cfc”/>

BlogCFC and MangoBlog both share xmlrpc bits. Nice.

53,956: <file line=”141″ path=”/home/jfrank/temp/codeprojects/BlogCFC5/client/xmlrpc/xmlrpc.cfc”/>
53,957: <file line=”122″ path=”/home/jfrank/temp/codeprojects/MangoBlog_1.5/api/xmlrpc.cfc”/>

The full results are not quite valid xml, because of encrypted cfms. They also contain a lot of boilerplate licenses as you can imagine. It would take a minor amount of cleanup of this output to make it parseable, and removing the licenses would make it much more compact. Also some of the duplication shown may be intentional due to code generation/plugin dependency domains.

Want to run it yourself?

Get the current PMD jar and grab my cfm-cpd.jar and this build file if you’d like to run it with ant.

Point Ant at a lib directory with those two jars, and run the build against your own code. Don’t forget the Dry Plugin if you run Hudson!

(Note, the cfm-cpd jar contains a couple overlapping classes with PMD, and relies on the fact that jars are loaded alphabetically and so it will win. This is lame, but so is PMD for making me hard code things in the ant task!)

2 responses so far

Feb 12 2010

Nginx Rocks at Proxying

Published by jfrank under internet, open source

I was pushing out the magnolia-railo-demo and was looking at the site performance. It was amazing by itself, sending back magnolia-railo built pages in 100ms including wire time from Portland to Dallas raw from Tomcat. On this box I have one IP though and I wanted to use a subdomain for the demo which means proxying vhosts.

When I proxied it through Apache’s httpd it took about twice as long for the page to load! (roughly 130-150ms for the base page) It seemed totally ridiculous… Now in all fairness I’m no Apache expert, but I decided to drop in nginx as a proxy on the front end and its cost as a proxy is nearly unnoticeable. I wouldn’t be able to tell it apart from hitting tomcat directly as far as speed is concerned.

I’ll have to give this a closer look for my other projects…

One slightly tricky thing is that if you need Apache’s ProxyPreserveHost in Nginx you need to use proxy_set_header Host like this:

location / {
   root   /usr/share/nginx/html;
   index  index.html index.htm;
   proxy_set_header Host $host;

Sometimes I wonder what it will be like when apache’s gone. I will probably find myself cding into /etc/httpd/logs and cat access_log over and over… just to remember old times.

No responses yet

Feb 11 2010

Magnolia and Railo Part 2 Redux

Published by jfrank under coldfusion, magnolia, open source

Here is the example app live from Magnolia and Railo Part 2. I fixed something in the config that was blocking import as well… so if anyone tried it let me know.


If anyone is interested in logging in, let me know and I will give you some credentials to log in and poke around.

For some reason, this is performing crazy fast right now. On the order of 100 ms, across the country network lag included, on an extremely underpowered rackspace cloud machine (256 megs of ram, running mysql, php, httpd, ngnix and tomcat for this app).

Turn on your firebug or chrome inspector and compare it to your CMS…

I defy any other pure Railo/CFML CMS from topping that for a single uncached hit with three custom cf paragraphs…

No responses yet

Feb 08 2010

Magnolia and Railo Part 2: Templating and Paragraphs

Published by jfrank under coldfusion, magnolia, open source

This is part two in a three part series on how to make Magnolia CMS and Railo work together beautifully.

If you missed part one you’ll need to have read it and done all steps to get anything out of this.

Step One: Wipe out your repositories folder

We’re starting over with content, so at this point I am assuming you dont have anything in your website tree that is not replaceable. If you have something you want to keep, export it first. This is /repositories in the webroot by default, and is the location that magnolia has stored its derby database. Delete it.

Step Two: Get the magnolia-railo-sample files

Export some sample files and configuration with subversion.

svn export https://www.joshuafrankamp.com/svn/incubator/magnolia-railo-sample/trunk/ tempWebroot

Step Three: Merge the sample files

Copy the contents of tempWebRoot into your webroot, this will overwrite the two files we wrote last time, and add many more.

Step Four: Start up your container

Bootstrapping of the app will occur. While it does, lets go over some of the files that I have prepared for you in this sample.

  1. Adding a template renderer
    1. config.modules.templating.template-renderers.cfm.xml
    2. This is required to bind the cfm type to a backing class. We will use the JSP renderer, because it is a dispatch model that is native to Railo since it is a web app.
  2. Creating a template definition node, and backing cfm
    1. config.modules.templating.templates.sampleTemplate.xml
    2. template.cfm with some assets/ that it needs to render properly
    3. This is a specific template instance, that in a normal magnolia application would represent one of many template choices. For this example, we will create a simple template.
  3. Creating a cf include paragraph definition, dialog, and backing cfm
    1. config.modules.templating.paragraphs.cf.xml
    2. config.modules.templating.dialogs.cf.xml
    3. cfinclude.cfm
  4. Creating an example content style “Whats New” paragraph, dialog, and backing cfm
    1. config.modules.templating.paragraphs.whatsNew.xml
    2. config.modules.templating.dialogs.whatsNew.xml
    3. whatsnew.cfm
  5. Creating a page dialog, this will be an editor for page level properties
    1. config.modules.templating.dialogs.page-properties.xml
  6. Example Page
    1. website.index.xml
    2. This page uses the sampleTemplate, and pulls in the example paragraph types in paragraph instances that show them off.  It also allows access to the page properties dialog.
  7. Magnolia Tag Libraries
    1. cms-taglib.tld
    2. These are typically in the magnolia jars, but we need them out where CF can get ahold of them.

Step Five: Hit /index

Magnolia will render the node at the path /index. Since this node is configured in the magnolia administrator to use the simpleTemplate template, it will include the definition of that template to render the page. The templatePath for simpleTemplate is /template.cfm which is a traditional CFML file on disk. At this point the standard Railo request lifecycle will start, including Application.cf(c|m). Then the file will execute, with one signifigant difference. The magnolia context has been set up, and some request scope variables have already been set for easy access to magnolia state information about the page.

A few notes about parts of the example template.cfm

  • The tag libraries give simple access to magnolia controls and data structures. These allow you to render magnolia chrome inline in CFML.Example:
    <cfimport taglib="/tags/cms" prefix="cmsmod" />
    <cfimport taglib="/META-INF/cms-taglib.tld" prefix="cms" />
  • There are three ways to read information from the content repository. The tag library, request scope context objects, and instantiation of singleton magnolia api objects. They all get the information from the same underlying jcr, but with varying levels of complexity and power.Example:
    <cms:setNode var="page" />...

    This pulls the properties from the current content node into a Java Map (struct) whose keys and values give you access to the properties. You can also get at the same information through request['state'] which the reader could dump out and traverse to find all kinds of wonderful things in the api

  • Paragraphs are at the heart of everything in Magnolia. They are the single reason for having templates, is the content you want to put on them. By default, Magnolia dispatches the requests for paragraphs to paragraph handlers which in turn for Java technologies, dispatch them as includes through the web app container. This is the point of failure for a technology like Railo, because each request in Railo assumes that it is the beginning and end of that request. It tries to read multi part data on the inbound request, it tries to set headers and response sizes on the outbound response. That is problematic for paragraphs and a big sticking point in the integration between Railo and Magnolia, and it is also not a problem anymore!
    Fortunately, Railo includes its own custom tag concept, and it is simple to implement around the problem. Replacing container includes with calls in a reimplementation of (contrary to its name, this is the tag responsible for including paragraphs, not templates). Look for <cmsmod: includeTemplate /> custom tag calls, in a different namespace. That is the custom implementation of the <cms:includeTemplate /> that would have required going outside the current Railo request.Example:

    <div id="center">
    	<cms:contentNodeIterator contentNodeCollectionName="main">
    		<cmsmod:includeTemplate />
    	<cms:newBar contentNodeCollectionName="main" paragraph="cf,whatsNew" contentnodename="mgnlNew"/>
    	<hr />

4 responses so far

Feb 07 2010

live blogging the superbowl

Published by jfrank under food

7:00 Etrade’s babies are creepy and not cool anymore. chunky.com is for soup.

6:48 Gold wins based on valor.

6:46 Imminent gatorade splashage

6:40 Blue leader just can’t do it! Gold holds them off right at the target!

6:38 Godaddy proves again that domains make women hot. Chickens scream again.

6:36 Blue leader is trying again. He is serious this time. Throws all the way down the green.

6:34 Announcer says “When moments are critical in a game, you go with what you do” Stupid men will even sit through a book club to get Bud Lite.

6:32 Doritos are worth killing for.

6:30 Gold took Blue leaders ball and ran it all the way down the field!!!

6:26 Green dystopia prosecutes the non green for using plastic and incandescent bulbs. Unles you own a green car! Barkley raps taco bell.

6:24 Blue leader has angry eyes!!!

6:22 Small digital squirrels love cars. Chickens scream like people, so we are supposed to want to eat them. So unappetizing.

6:20 Gold has a challenge about some kind of tricky play. They win it and for cunning get 2 points.

6:16 There is a hell emulator as a game. Bulls and horses are friends.

6:12 Gold scores again. There is much valor!

6:10 Gold ruins Blue’s day, and now they are pushing back the other way.

6:06 Emerald nuts are crazy, producing unbelievable feats of athletics.

6:02 Blue is moving down the green.

5:58 Comcast wants you to know how much HD they have, their HD is way WAY bigger than satellite’s

5:56 CBS wants you to watch the ads more

5:54 Roundup is creepy. It will kill things even after its done killing things.

5:48 A bad tape job can ruin a man.

5:44 A man sleepwalks through Africa facing dangers but eventually gets a Coke. Babies are creepy.

5:42 A man with hearing loss would trade his wife for his awesome tires.

5:40 Men are really good exercisers and they drink michelob ultra

5:38 The Blue leader is so upset that he is driving down the field for revenge! And he gets it!

5:30 Girls post pictures of themselves on the internet, and men everywhere get in trouble for looking at them.

5:28 Gold ones score a quick win.

5:23 Gold ones did some tricky move on the kick off. They got their own ball.

5:22 JayZ wins the superbowl… ? At least he has the trophy, and he runs this town. Good luck with that one. It is a full on brawl on the field.

5:20 Ford vehicles are literally made of water, and are self assembling.

5:16 Gold has 6 and Blue has 10. tv.com wants to be hulu. Toyota is now safe and trusts you.

5:15 Lots of new tv shows. “From the guy who brought you all those boring CSI knockoffs”.

5:02 The half time band is a bunch of old guys, who are reminding the world about an archaic game called pinball! A more timely message could not be found for the United States. But it turns out it is a medley.

5:00 Halftime report is awesome. Five guys with stick mics yelling incoherently at each other. It is almost as if if they had head mics they wouldn’t know what to do with their hands.

4:56 30 minutes of play, in merely two hours!

4:52 Fred Meyer’s has the single worst commercials… ever. (Billboards too)

4:49 Robots can’t handle derisive opinions. Intel has robot employees.

4:47 Women eviscerate men, and remove their spine. But if men have a portable tv, they regain their spine.

4:46 Harry Freaking Potter is a real place.

4:42 Little people commercials back to back. Little people 2010!

4:38 They are putting one of those crime dramas in space. CIS Moon will be next.

4:36 Men have a terrible life because of the things they put up with and do for their women. But they use it to negotiate with their spouses to buy an expensive car.

4:32 Gold does a good job getting down the green.

4:26 Women have good ideas, but men find the beer and everyone loves them better because they are easy going. If you succeed at being a man you can use some dove specialty soaps.

4:23 Back to back pantsless commercials. Pants are out in 2010!

4:22 A dude has a message for women. He thinks they are important if they watch football. And wants them to know their heart attacks are different.

4:18 Nothing rallies a town like beer. People will form a human bridge to get the truck through!

4:14 Gold makes some serious forward motion with much valor. Perhaps points are in order?

4:09 A good bachelor party requires an endangered species (in this case a killer whale).

4:06 Bud Lite makes you sound like TPain, women get the groceries and you just have to party!

4:04 Domain names excite women!

4:03 We should be kind to mean rich people when they lose their money.

4:02 Its not looking good for gold.

3:58 Somewhere in the past the blue ones were awarded three points for cunning and valor.

3:54 The blue ones have many tricky moves.

3:50 Slapping people on the back of the head is so cool.

3:46 Boost moble scares the !!!!! out of me. Dogs hurt people.

3:44 Focus on the Family hurts women. But they are really ok!?

3:42 The oldest man ever just kicked the ball.  Snickers hurts old people.

3:38 The blue ones seem pretty good at throwing, while the gold ones scurry about.

3:34 The gold ones had the ball, but now the blue ones have it.

3:31 Broadcast is also in Spanish.

3:29 The Hundai Sonata is less than 20k. You can quote them on that.

3:36 Modern NBA players don’t know who Larry Bird is.

3:34 Wife says snarky comments should go to the Internet not her. Rogain is gross.

3:21 PST Lots of sound problems so far. Queen Latifa had to take out her ear monitor. The intro videos for each team had a lot of mumbling.

One response so far

Jan 19 2010

Magnolia and Railo, Together at Last

Published by jfrank under coldfusion, magnolia, open source

Railo and Magnolia CMS are finally together. It’s been a bit of a bumpy road, what with issues arising immediately when they first met. A casual observer might think the relationship was doomed. But a funny thing happened over the course of the last year, Railo changed for the better and for that matter so did Magnolia. I’m proud to say they’ve even moved in to the same bulding at JBoss (in a matter of speaking). From what I hear the landlord thinks the world of both of them but they have still led separate lives.

Until now.

This will be the first in a series of magnolia-railo merged app tutorials. Part one will walk through the merging of the two web apps into one. Part two shows how to enable Railo templating and explore the interaction between Magnolia and Railo. Part three is about the dark side, potential problems and workarounds.

Before we dive in, I will address the question “why should these two technologies be merged?” It is simple really. Magnolia is a superior CMS to anything offered in the CFML world, and has amazing interoperability with other technologies. Railo on the other hand, blows the pants off any of the native templating options available in Magnolia and its rapid and powerful tag based language easily unlocks the power of Magnolia’s Jackrabbit JCR. In short, they were made for each other.

Lets put these two together:

  1. Download the latest railo custom all os war, extract to /merge-target. I’ll refer to this directory as the location of our new merged app, however you can name it what you wish.
  2. Delete these files from /merge-target/WEB-INF/lib These files overlap with magnolia’s jars.
    1. apache-jakarta-commons-codec.jar
    2. apache-jakarta-commons-collections.jar
    3. apache-jakarta-commons-fileupload.jar
    4. apache-jakarta-commons-httpclient.jar
    5. apache-jakarta-commons-io.jar
    6. apache-jakarta-commons-lang.jar
    7. apache-jakarta-oro.jar
    8. apache-lucene.jar
    9. backport-util-concurrent.jar
    10. concurrent.jar
    11. PDFBox.jar
  3. Download the latest railo patch (currently Put it in /merge-target/WEB-INF/lib/railo-server/patches/
  4. Download Magnolia Community Edition war extract to /magnolia-temp
  5. Copy /magnolia-temp/WEB-INF/lib/* to /merge-target/WEB-INF/lib/* There should be no file name conflicts.
  6. Copy /magnolia-temp/WEB-INF/config to /merge-target/WEB-INF/config
  7. Merge web.xmls. Use Magnolia’s xml header with namespaces! It doesn’t like anything else, because it reads it in at runtime. Here is the one I created:
    <?xml version="1.0" encoding="UTF-8"?>
    <web-app xmlns="http://java.sun.com/xml/ns/j2ee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd" version="2.4">
        <display-name>Magnolia global filters</display-name>
    		<description>CFML runtime Engine</description>
    	      <description>Configuraton directory</description>
    		<!-- init-param>
    	      <description>directory where railo root directory is stored</description>
    	    </init-param -->
    		<description>AMF Servlet for flash remoting</description>
    		<description>File Servlet for simple files</description>
  8. Create file /merge-target/WEB-INF/bootstrap/common/config.server.filters.bypasses.railo.xml
    <?xml version="1.0" encoding="UTF-8"?>
    <sv:node sv:name="railo" xmlns:sv="http://www.jcp.org/jcr/sv/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
      <sv:property sv:name="jcr:primaryType" sv:type="Name">
      <sv:property sv:name="jcr:uuid" sv:type="String">
      <sv:property sv:name="class" sv:type="String">
      <sv:property sv:name="pattern" sv:type="String">
      <sv:node sv:name="MetaData">
        <sv:property sv:name="jcr:primaryType" sv:type="Name">
        <sv:property sv:name="mgnl:activated" sv:type="Boolean">
        <sv:property sv:name="mgnl:activatorid" sv:type="String">
        <sv:property sv:name="mgnl:authorid" sv:type="String">
        <sv:property sv:name="mgnl:creationdate" sv:type="Date">
        <sv:property sv:name="mgnl:lastaction" sv:type="Date">
        <sv:property sv:name="mgnl:lastmodified" sv:type="Date">
  9. Set magnolia.update.auto=true in /WEB-INF/config/default/magnolia.properties
  10. Start your container such as tomcat.
  11. Go to /.magnolia, login superuser/superuser
  12. Navigate to Configuration. server->filters->bypasses. You will see the railo node appears, in the configuration bypasses. If you inspect it you will see it is a uriStartsWithVoter that is configured with /railo-context. This node gives you bypassing of magnolia filters for any url starting with /railo-context.
  13. Because of this you can hit /railo-context/admin/index.cfm and setup railo.

The bypass rule is required because Magnolia controls the url space with its chain of filters. You can right click on it and copy it to make another arbitrary bypass that is applied immediately. The filters are configurable and this gives you flexibility in fronting CFML apps with arbitrary urls configurable at runtime. At this point you can build anything you want using Magnolia’s built in templating. The bypasses allow you to run standalone CFML apps in the same JVM, under specified url paths (/myapp/*).

That is it for part one. The next article highlights how your templates and paragraphs can be built in standard CFML, and the interaction between Magnolia and Railo.

One response so far

Aug 06 2009

earl grey ice cream

Published by jfrank under food

Earl Grey Ice Cream

  • larger ziploc bag (1 gallon or 1 quart)
  • smaller ziploc bag (1 pint or 1 quart)
  • ice
  • salt (rock preferred)
  • 2 earl grey tea bags
  • 1/2 cup water
  • 1/4 cup whipping cream
  • 1/4 cup whole milk
  • 1 tablespoon sugar

Heat the water to boiling, pull it off the heat and put in the tea bags. Let it steep for a few minutes, pull out the tea bags. Squeeze the teabags and attempt to get all the flavour out into the liquid, it should be dark brown.

Put the concentrated tea, sugar, cream and milk in to the smaller bag and seal it.

Pack the larger bag 1/3 full of ice and salt, place the smaller sealed bag into the larger bag, place more ice and salt on top. The goal is to completely surround the smaller bag with ice.

Shake for about 10 minutes or until inner bag is ice cream.

Eat the best ice cream in the world!

One response so far

Jul 07 2009

svnant 1.3

Published by jfrank under svnant

You’ve been waiting for it! You’ve been porting to other solutions! But wait! here it is!

svnant 1.3 with subversion 1.6.X support.

Tigris happens to be doing maintainence now but it will be up there soon too.

9 responses so far

Mar 18 2009

how to count connections on windows command line

Published by jfrank under Uncategorized

At work I am troubleshooting apache/tomcat/coldfusion/magnolia stack on windows. It is the most fun thing you can imagine. I was messing with apache’s configuration on mod_proxy_balancer and I needed to a way to test if my settings were having the desired effect.

I ran into a *nix shell script that did this, but not one for windows. This is super simple way to count connections against your localhost’s port 80, web server. It shows established and waiting connections.

Save this as a  .bat file and it will output them. It could be optimized in a number of ways, but who cares!

echo port 80:
netstat -a -n | find "TCP " | find /C "ESTAB"
netstat -a -n | find "TCP " | find /C "TIME_WAIT"

No responses yet

« Prev - Next »