Feb 11 2008
GSA Statistics python scripts
I have been tasked with upgrading Google search appliances and in doing so I wanted to calculate some statistics.
Comparing Crawled Pages Across GSA’s (or mini’s)
You could use this to compare any two url xml files from the Crawl Diagnostics –> Export All Pages To a File
I could have called them A and B, but I was comparing a mini and a gsa at the time, so the naming convention in the file goes. It uses simple python sets to see what is crawled in one machine, but not the other etc. Expects two local files, mini-urls.xml and gsa-urls.xml; either could be a gsa or mini export.
import xml.dom.minidom
def main():
miniUrlSet = extractUrls(xml.dom.minidom.parse('mini-urls.xml'))
gsaUrlSet = extractUrls(xml.dom.minidom.parse('gsa-urls.xml'))
print 'mini', len(miniUrlSet)
print 'gsa', len(gsaUrlSet)
print 'intersections', len(miniUrlSet & gsaUrlSet)
print 'mini is sub of gsa?', miniUrlSet <= gsaUrlSet
gsaNotMini = gsaUrlSet - miniUrlSet
print 'things in gsa but not mini:', len(gsaNotMini)
for i in gsaNotMini:
print i
miniNotGsa = miniUrlSet - gsaUrlSet
print 'things in mini but not gsa', len(miniNotGsa)
for i in miniNotGsa:
print i
def extractUrls(dom):
nodelist = dom.getElementsByTagName("loc")
urls = set()
for node in nodelist:
urls.add(node.firstChild.data) #i know all loc nodes have a single child text node, text nodes have a data property
return urls
main()
Calculating Search Keywords Density Over Time
This is calculated against an export from the search logs feature under status and reports. You export the timeframe you want to compare over, and then run this against the log file. You get the top 100 keywords that people searched for, and the counts of how many times they were searched. Expects a local file log.log.
from datetime import datetime
from operator import itemgetter
def getQueryCounts(f):
import re
words = {}
qReg = re.compile('.*?&q=(.*?)&')
for l in f:
keyword = qReg.findall(l)
if(len(keyword) and len(keyword[0])):
words[keyword[0]] = words.get(keyword[0], 0) + 1
return words
start = datetime.now()
f=open('log.log')
words = getQueryCounts(f)
f.close()
top = sorted(words.iteritems(),key=itemgetter(1),reverse=True)[:100]
print 'Top Words'
print '---------'
for word, num in top:
print word, num
print 'runtime:', datetime.now() - start
raw_input("press enter")