Apr
12
2008
I found an interesting site called maxmind while looking for a database of country information.
they have a simple iso list,
http://www.maxmind.com/app/iso3166
and a database of world cities
http://www.maxmind.com/app/worldcities
free.
Feb
11
2008
I have been tasked with upgrading Google search appliances and in doing so I wanted to calculate some statistics.
Comparing Crawled Pages Across GSA’s (or mini’s)
You could use this to compare any two url xml files from the Crawl Diagnostics –> Export All Pages To a File
I could have called them A and B, but I was comparing a mini and a gsa at the time, so the naming convention in the file goes. It uses simple python sets to see what is crawled in one machine, but not the other etc. Expects two local files, mini-urls.xml and gsa-urls.xml; either could be a gsa or mini export.
import xml.dom.minidom
def main():
miniUrlSet = extractUrls(xml.dom.minidom.parse('mini-urls.xml'))
gsaUrlSet = extractUrls(xml.dom.minidom.parse('gsa-urls.xml'))
print 'mini', len(miniUrlSet)
print 'gsa', len(gsaUrlSet)
print 'intersections', len(miniUrlSet & gsaUrlSet)
print 'mini is sub of gsa?', miniUrlSet <= gsaUrlSet
gsaNotMini = gsaUrlSet - miniUrlSet
print 'things in gsa but not mini:', len(gsaNotMini)
for i in gsaNotMini:
print i
miniNotGsa = miniUrlSet - gsaUrlSet
print 'things in mini but not gsa', len(miniNotGsa)
for i in miniNotGsa:
print i
def extractUrls(dom):
nodelist = dom.getElementsByTagName("loc")
urls = set()
for node in nodelist:
urls.add(node.firstChild.data) #i know all loc nodes have a single child text node, text nodes have a data property
return urls
main()
Calculating Search Keywords Density Over Time
This is calculated against an export from the search logs feature under status and reports. You export the timeframe you want to compare over, and then run this against the log file. You get the top 100 keywords that people searched for, and the counts of how many times they were searched. Expects a local file log.log.
from datetime import datetime
from operator import itemgetter
def getQueryCounts(f):
import re
words = {}
qReg = re.compile('.*?&q=(.*?)&')
for l in f:
keyword = qReg.findall(l)
if(len(keyword) and len(keyword[0])):
words[keyword[0]] = words.get(keyword[0], 0) + 1
return words
start = datetime.now()
f=open('log.log')
words = getQueryCounts(f)
f.close()
top = sorted(words.iteritems(),key=itemgetter(1),reverse=True)[:100]
print 'Top Words'
print '---------'
for word, num in top:
print word, num
print 'runtime:', datetime.now() - start
raw_input("press enter")
Dec
20
2007
I finally got shared folders up and running on my virtual fedora box. This required a little kernel/kernel headers upgrading, and compiling the vmware tools for my box, but it works like a charm. It even gives me cut and paste to the win xp desktop, which is… cool i guess.
I decided to go with python, which has a plethora of tools. Pylons is a piecemeal web framework that is closest to my liking, migrate is a library for schema migration, which works nicely with sqlalchemy, a monster orm.
sqlalchemy is cool because you can use parts of it totally independently. Coming from a CF background I am used to having nice named/pooled connections that I don’t have to think about. The base layer of sqlalchemy is that, a database type abstraction and pooling. Then you are free to go crazy with ORMish things or not, its up to you.
It is so reusable many people have written layers on top of it for even more magical coding… but its nice to have all the options.
Migrate, a RoR knockoff is the real find though, it looks young (as far as a project goes) but I watched a demo of it used in another python framework and it was exactly as I expected, like something we use at work for CF. It has a schema version table, that holds app state version, and version files with ‘up/down’ methods. My main issue with many of these ’scafolmagic’ things is that no one bothered to mention how you get from one version to the next… or back again. You can’t build the model right the first time, and iterative programming is a fact of life. This library addresses that.
Dec
19
2007
Cookie name “CFAUTHORIZATION_SPLAT SPLAT” is a reserved token
The error occurred in administrator.cfc: line 116
Today I ran into a weird bug in the cf admin api, if you attempt to perform a login such as this:
<cfscript>
loggedin = createObject(“component”,”cfide.adminapi.administrator”).login(‘dsafdsafsad’);
</cfscript>
It will bomb with the above error if your application name contains a space, the error is slightly different whether you use Application.cfc or <cfapplication> style.
The fix, thanks to Barney is to remove the space.
Dec
15
2007
I am using a virtual linux server via free vmware player on my development box, which is winxp. I found several groups who produce free stock distributions packaged in virtual machine format, which means I can run on literally the same stack as my real server locally.
I haven’t got there quite yet, but I intend to use a windows eclipse ide mapped into the virtual box via shared folders.
The subversion usage up to this point has left me with simple tasks to sync the two server’s configurations.
next up…. python or ruby
Dec
07
2007
Hi everyone who isn’t there.
I have really enjoyed setting up my new server, and my first project was to get a blog up and running under my domain name. I have had this domain for years but haven’t gotten around to developing anything on it. So here it is. It didn’t take me long to get it set up, and I version controlled the whole thing as I was doing it. So if someone were to wipe out my server right now, I would still be able to regenerate this ..
er i take it back.
I haven’t set up backups yet… and although what I said was true because I have a blog working copy, I would lose this post because I don’t have mysql backing up yet. What I was saying was, I could regenerate this blog in a couple commands.
Anyway so the first few posts are going to be about server setup, and me learning linux.