Tuesday, March 14, 2006

Webcrawler using libxml, libcurl and tidy

Contrary to my writeup in the last post about how wget might be the best way to webcrawl and fetch files to a local cache, my thoughts now are different.

You can use the following libraries to build a decent webcrawler:

1. Tidy: Use tidylib to clean up your HTML pages and make them XHTML. tidylib's webpage has sample code that is good enough for converting HTML to XHTML - just make sure you save to a file using tidySaveFile().

libxml has problems parsing HTML, even if used with xmlRecoverFile() rather than xmlParseFile().

2. libxml: Parse the XHTML, get all elements' attributes (and any other URLs you need) and pass on the URLs to libcurl to download. Need I say more?

Well, actually I should. libxml is a little hard to understand from the API, and sample code to do what you want is hard to find. I had to do quite a bit of searching, looking up sample programs, and then reading the API to figure out how things worked.

3. curl: Or rather libcurl. To retrieve files from the Net. Again, need I say more?

Life would have been simpler if curl had a recursive download function ... or wget had a library I could use ... but then, that's why we computer engineers and students have a life!

No comments: