Nov
28
2008
Python packaging is a pain in the ass. There are some tools to make it easy, so easy in fact that it becomes even worse…
easy_install is the easiest thing since sliced bread. What does it do? Everything. Its so magic it probably installs itself recursively just for fun.
You want a package?
Ok just type this: easy_install sqlalchemy (for the awesome ORM package for python)
It magically goes and finds sqlalchemy, and installs it INTO your system python installed path.
Why is the standard assumption that if I want to use a python package that is say a dependency for my project, that I want to INSTALL IT INTO PYTHON running on my system?
What kind of crazy idea is this? It causes all kinds of issues. The first and most obvious is: What If I have two programs that expect different versions of a given package? Since the packages are installed in to the runtime and not my app, you have to know about this issue and work around it.
If packages were managed the java way, the assumption would be that I want to install the package in the app that I am working on, not into /systemjdk/extensions/somePackage
The only argument FOR doing it this way that I can think of is saving disk space. Disk space is cheap.
/rant.
Ok so honestly, can anyone tell me why this is?
Feb
11
2008
I have been tasked with upgrading Google search appliances and in doing so I wanted to calculate some statistics.
Comparing Crawled Pages Across GSA’s (or mini’s)
You could use this to compare any two url xml files from the Crawl Diagnostics –> Export All Pages To a File
I could have called them A and B, but I was comparing a mini and a gsa at the time, so the naming convention in the file goes. It uses simple python sets to see what is crawled in one machine, but not the other etc. Expects two local files, mini-urls.xml and gsa-urls.xml; either could be a gsa or mini export.
import xml.dom.minidom
def main():
miniUrlSet = extractUrls(xml.dom.minidom.parse('mini-urls.xml'))
gsaUrlSet = extractUrls(xml.dom.minidom.parse('gsa-urls.xml'))
print 'mini', len(miniUrlSet)
print 'gsa', len(gsaUrlSet)
print 'intersections', len(miniUrlSet & gsaUrlSet)
print 'mini is sub of gsa?', miniUrlSet <= gsaUrlSet
gsaNotMini = gsaUrlSet - miniUrlSet
print 'things in gsa but not mini:', len(gsaNotMini)
for i in gsaNotMini:
print i
miniNotGsa = miniUrlSet - gsaUrlSet
print 'things in mini but not gsa', len(miniNotGsa)
for i in miniNotGsa:
print i
def extractUrls(dom):
nodelist = dom.getElementsByTagName("loc")
urls = set()
for node in nodelist:
urls.add(node.firstChild.data) #i know all loc nodes have a single child text node, text nodes have a data property
return urls
main()
Calculating Search Keywords Density Over Time
This is calculated against an export from the search logs feature under status and reports. You export the timeframe you want to compare over, and then run this against the log file. You get the top 100 keywords that people searched for, and the counts of how many times they were searched. Expects a local file log.log.
from datetime import datetime
from operator import itemgetter
def getQueryCounts(f):
import re
words = {}
qReg = re.compile('.*?&q=(.*?)&')
for l in f:
keyword = qReg.findall(l)
if(len(keyword) and len(keyword[0])):
words[keyword[0]] = words.get(keyword[0], 0) + 1
return words
start = datetime.now()
f=open('log.log')
words = getQueryCounts(f)
f.close()
top = sorted(words.iteritems(),key=itemgetter(1),reverse=True)[:100]
print 'Top Words'
print '---------'
for word, num in top:
print word, num
print 'runtime:', datetime.now() - start
raw_input("press enter")
Dec
20
2007
I finally got shared folders up and running on my virtual fedora box. This required a little kernel/kernel headers upgrading, and compiling the vmware tools for my box, but it works like a charm. It even gives me cut and paste to the win xp desktop, which is… cool i guess.
I decided to go with python, which has a plethora of tools. Pylons is a piecemeal web framework that is closest to my liking, migrate is a library for schema migration, which works nicely with sqlalchemy, a monster orm.
sqlalchemy is cool because you can use parts of it totally independently. Coming from a CF background I am used to having nice named/pooled connections that I don’t have to think about. The base layer of sqlalchemy is that, a database type abstraction and pooling. Then you are free to go crazy with ORMish things or not, its up to you.
It is so reusable many people have written layers on top of it for even more magical coding… but its nice to have all the options.
Migrate, a RoR knockoff is the real find though, it looks young (as far as a project goes) but I watched a demo of it used in another python framework and it was exactly as I expected, like something we use at work for CF. It has a schema version table, that holds app state version, and version files with ‘up/down’ methods. My main issue with many of these ’scafolmagic’ things is that no one bothered to mention how you get from one version to the next… or back again. You can’t build the model right the first time, and iterative programming is a fact of life. This library addresses that.