You can use the following libraries to build a decent webcrawler:
1. Tidy: Use tidylib to clean up your HTML pages and make them XHTML. tidylib's webpage has sample code that is good enough for converting HTML to XHTML - just make sure you save to a file using tidySaveFile().
libxml has problems parsing HTML, even if used with xmlRecoverFile() rather than xmlParseFile().
2. libxml: Parse the XHTML, get all elements'
Well, actually I should. libxml is a little hard to understand from the API, and sample code to do what you want is hard to find. I had to do quite a bit of searching, looking up sample programs, and then reading the API to figure out how things worked.
3. curl: Or rather libcurl. To retrieve files from the Net. Again, need I say more?
Life would have been simpler if curl had a recursive download function ... or wget had a library I could use ... but then, that's why we computer engineers and students have a life!
No comments:
Post a Comment