Monday, November 07, 2005

libxml for parsing HTML

Long time no see. Have been busy with preparing for my EE PhD qualifiers!

Well anyway, now I am to create a HTTP-retreiver-&-parser to get files for the 7DS multicast query system. It now has to not only get the result-set for the 7DS queries, but should also get the files themselves, as well as associated elements, such as images, etc.

I found several HTML parsers for C (after long searches) such as ekhtml (nil documentation), tidy (library does not build properly) and LibWWW (supposed to be very complicated) ... and have settled on using LibXML's inbuilt HTML parsing tools.

Sad that open source code has very little documentation ... hey, but neither does 7DS yet!

No comments: