Archive for February, 2008

Feb 11 2008

GSA Statistics python scripts

Published by jfrank under python

I have been tasked with upgrading Google search appliances and in doing so I wanted to calculate some statistics.

Comparing Crawled Pages Across GSA’s (or mini’s)

You could use this to compare any two url xml files from the Crawl Diagnostics –> Export All Pages To a File
I could have called them A and B, but I was comparing a mini and a gsa at the time, so the naming convention in the file goes. It uses simple python sets to see what is crawled in one machine, but not the other etc. Expects two local files, mini-urls.xml and gsa-urls.xml; either could be a gsa or mini export.

import xml.dom.minidom

def main():
    miniUrlSet = extractUrls(xml.dom.minidom.parse('mini-urls.xml'))
    gsaUrlSet = extractUrls(xml.dom.minidom.parse('gsa-urls.xml'))
    print 'mini', len(miniUrlSet)
    print 'gsa', len(gsaUrlSet)
    print 'intersections', len(miniUrlSet & gsaUrlSet)
    print 'mini is sub of gsa?', miniUrlSet <= gsaUrlSet
    gsaNotMini = gsaUrlSet - miniUrlSet
    print 'things in gsa but not mini:', len(gsaNotMini)
    for i in gsaNotMini:
        print i
    miniNotGsa = miniUrlSet - gsaUrlSet
    print 'things in mini but not gsa', len(miniNotGsa)
    for i in miniNotGsa:
        print i

def extractUrls(dom):
    nodelist = dom.getElementsByTagName("loc")
    urls = set()
    for node in nodelist:
        urls.add(node.firstChild.data) #i know all loc nodes have a single child text node, text nodes have a data property
    return urls

main()

Calculating Search Keywords Density Over Time

This is calculated against an export from the search logs feature under status and reports. You export the timeframe you want to compare over, and then run this against the log file. You get the top 100 keywords that people searched for, and the counts of how many times they were searched. Expects a local file log.log.

from datetime import datetime
from operator import itemgetter

def getQueryCounts(f):
       import re
       words = {}
       qReg = re.compile('.*?&q=(.*?)&')
       for l in f:
              keyword = qReg.findall(l)
              if(len(keyword) and len(keyword[0])):
                  words[keyword[0]] = words.get(keyword[0], 0) + 1
       return words

start = datetime.now()
f=open('log.log')
words = getQueryCounts(f)
f.close()
top = sorted(words.iteritems(),key=itemgetter(1),reverse=True)[:100]
print 'Top Words'
print '---------'
for word, num in top:
       print word, num
print 'runtime:', datetime.now() - start
raw_input("press enter")

No responses yet