Thursday, May 28, 2009

How to fetch a complete site from google cache

After removing accidentally one of my blogs in blogger, I decided to create this script to retrieve all the blog posts from google cache.

The script is coded in python and makes use of BeatifulSoup library and wget command-line program. Replace "viaje-china.blogspot.com" with the url of the site you want to fetch from google cache, and modify the array [0, 10, 20 ...] according to the number of pages of results you need to download from google cache.

import os
import httplib
from BeautifulSoup import BeautifulSoup

for i in [0, 10, 20]:
conn = httplib.HTTPConnection("www.google.com")
conn.request("GET", "/search?q=site%3Aviaje-china.blogspot.com&start="+str(i))
html = conn.getresponse().read()

soup = BeautifulSoup(html)

for tag in soup.findAll('a'):
isCache = tag['href'].find("q=cache:") != -1
if isCache:
os.system("wget -t 5 -r -l 2 -k --user-agent=\"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1 Gecko/20060111 Firefox/1.5.0.1\" \"" + tag['href'] + "\"")

No comments: