Friday, March 31, 2006

Always use sizeof() when malloc()ing

I've had some real big problems implementing my webpage retreiver, and I fixed it once I realized that I had to include the sizeof() everytime I malloc()d.

So remember to use this everytime you malloc, say a string:


newstring = malloc ((strlen(oldstring)+1) * sizeof(char));


Remember to add 1 as well, just as I did: C strings terminate with a '\0', which is an extra character.

Thursday, March 30, 2006

Memory leak for null string assignment

This is really wierd... the following piece of code in my 7DS system results in a huge memory leak, gobbling up memory really fast.


if (0 == hits) {
// No results, empty xmlResults and return
sprintf (xmlResult, "");
return 0;
}


Disabling it solves the memory problem - I wonder why?

Tuesday, March 28, 2006

malloc() and free() error for dynamic strings solved

OK, I am a newbie at dynamic strings in C, so please forgive my silliness.

I have been getting "*** glibc detected *** free(): invalid next size (fast)" errors in my application that has to create dynamic path names and couldn't figure it out.

Finally I did a Google search and found the solution here: http://www.eskimo.com/~scs/cclass/int/sx7.html

Guess what I had done? malloc()d the string to use one less character than needed like this:
escapedURL = malloc (strlen (URL));

This is CORRECT:
escapedURL = malloc (strlen (URL) + 1);

Because C strings end with a '\0' character.

Monday, March 27, 2006

Parsing filename and directories out of given path in C

Sample code for how to parse directory structure and path names for a given path string in C. I will use this for the webpage retreiver project I am working on.


#include <string.h>
#include <stddef.h>
#include <libgen.h>
#include <stdio.h>

/* This program parses the path given in the argument into directories
* and filename, creates the directory structure and creates an
* empty file as well */
int
main (int argc, char **argv)
{
const char delimiters[] = "/\\"; /* File delimiters */
char *token, *oldtoken, *cp;
const char program_directory = getcwd (NULL, 0); /* Program directory */

/* Create copy of path string */
cp = malloc (strlen (argv[1]));
strcpy (cp, argv[1]);

/* Split string into tokens */
token = strtok (cp, delimiters);

printf ("Directory = ");

/* While token is not NULL */
while (1)
{
oldtoken = malloc (strlen (token));
strcpy (oldtoken, token); /* Copy current token */
token = strtok (NULL, delimiters); /* Go to the next token */
/* If nexxt token is NULL, it is the last part and assumed to
* be a filename */
if (token == NULL)
{
printf ("\nFilename = %s\n", oldtoken);
/* Create an empty file of that name */
FILE *fp;
fp = fopen (oldtoken, "w");
fclose (fp);
break;
}
/* Otherwise it is a directory */
else
{
printf ("%s ", oldtoken);
mkdir (oldtoken, 0755); /* Create the directory */
chdir (oldtoken); /* Go there */
}
free (oldtoken);
}
printf ("\n");
/* Free any memory elements */
free (cp);
oldtoken = NULL;

chdir (program_directory);
}

Friday, March 17, 2006

URL or URI parsing in libxml and c

I found out that libxml has URI functions that will allow you to parse URIs using libxml.

Here's an example program:


#include <stdio.h>
#include <libxml.h>

int main(int argc, char **argv) {

/* Create a null URI */
xmlURIPtr url = xmlCreateURI ();

/* Parse the user input URI */
url = xmlParseURI ( (argc <= 1) ? "http://www.theepochtimes.com/" : argv[1]);

/* Print all the respective information */
printf ("scheme = %s\n", url->scheme);
printf ("opaque = %s\n", url->opaque);
printf ("authority = %s\n", url->authority);
printf ("server = %s\n", url->server);
printf ("user = %s\n", url->user);
printf ("port = %d\n", url->port);
printf ("path = %s\n", url->path);
printf ("query = %s\n", url->query);
printf ("fragment = %s\n", url->fragment);
printf ("cleanup = %d\n", url->cleanup);
}

Parsing HTML using tidy and tidylib

It's so hard to find a C program on the web that can parse HTML! Yes, you can find parsers written in Perl and other languages, but not C!

So I might as well share what I've learnt so far. I am making the 7DS HTML parser in libxml, but I experimented using tidy and tidylib as well, and here's how the code for that looks:


#include <tidy.h&rt;
#include <buffio.h&rt;
#include <stdio.h&rt;
#include <errno.h&rt;

/**
* Dump the list of nodes and their attributes
* Modified from tidylib documentation
*/
void dumpNode( TidyNode tnod, int indent )
{
TidyNode child;

for ( child = tidyGetChild(tnod); child; child = tidyGetNext(child) )
{
ctmbstr name = tidyNodeGetName( child );
if ( !name )
{
switch ( tidyNodeGetType(child) )
{
case TidyNode_Root: name = "Root"; break;
case TidyNode_DocType: name = "DOCTYPE"; break;
case TidyNode_Comment: name = "Comment"; break;
case TidyNode_ProcIns: name = "Processing Instruction"; break;
case TidyNode_Text: name = "Text"; break;
case TidyNode_CDATA: name = "CDATA"; break;
case TidyNode_Section: name = "XML Section"; break;
case TidyNode_Asp: name = "ASP"; break;
case TidyNode_Jste: name = "JSTE"; break;
case TidyNode_Php: name = "PHP"; break;
case TidyNode_XmlDecl: name = "XML Declaration"; break;

case TidyNode_Start:
case TidyNode_End:
case TidyNode_StartEnd:
default:
assert( name != NULL ); // Shouldn't get here
break;
}
}
assert( name != NULL );
char whitespace[indent];
memset (whitespace, ' ', indent);
whitespace[indent-1] = '\0';
// printf( "%sNode: %s\n", whitespace, name );

/* Get the first attribute for all nodes */
TidyAttr tattr = tidyAttrFirst (child);
while (tattr != NULL) {
/* Print the node and its attribute */
printf ("%s %s %s= %s\n", whitespace, tidyNodeGetName (child), tidyAttrName (tattr), tidyAttrValue (tattr));
/* Get the next attribute */
tattr = tidyAttrNext (tattr);
}
dumpNode( child, indent + 4 );
}
}

/* Dump the whole document */
void dumpDoc( TidyDoc tdoc )
{
dumpNode( tidyGetRoot(tdoc), 0 );
}

/* Dump only the body */
void dumpBody( TidyDoc tdoc )
{
dumpNode( tidyGetBody(tdoc), 0 );
}

int main(int argc, char **argv )
{
/* Input file: Either the first argument or "../test.html" */
const char* input = (argc > 1) ? argv[1] : "../test.html";
TidyBuffer output = {0};
TidyBuffer errbuf = {0};
int rc = -1;
Bool ok;

TidyDoc tdoc = tidyCreate(); // Initialize "document"
printf( "Tidying:\t%s\n", input );

ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes ); // Convert to XHTML
if ( ok )
rc = tidySetErrorBuffer( tdoc, &errbuf ); // Capture diagnostics
if ( rc >= 0 )
/* Read from the HTML file */
rc = tidyParseFile( tdoc, input ); // Parse the input
if ( rc >= 0 )
rc = tidyCleanAndRepair( tdoc ); // Tidy it up!
if ( rc >= 0 )
rc = tidyRunDiagnostics( tdoc ); // Kvetch
if ( rc > 1 ) // If error, force output.
rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 );
if ( rc >= 0 )
rc = tidySaveBuffer( tdoc, &output ); // Pretty Print

if ( rc >= 0 )
{
if ( rc > 0 )
printf( "\nDiagnostics:\n\n%s", errbuf.bp );
printf( "\nAnd here is the result:\n\n%s", output.bp );
}
else
printf( "A severe error (%d) occurred.\\n", rc );

tidyBufFree( &output );
tidyBufFree( &errbuf );

/* Now parse and print the tags in the HTML document */
dumpDoc (tdoc);

tidyRelease( tdoc );
return rc;
}

Tuesday, March 14, 2006

Webcrawler using libxml, libcurl and tidy

Contrary to my writeup in the last post about how wget might be the best way to webcrawl and fetch files to a local cache, my thoughts now are different.

You can use the following libraries to build a decent webcrawler:

1. Tidy: Use tidylib to clean up your HTML pages and make them XHTML. tidylib's webpage has sample code that is good enough for converting HTML to XHTML - just make sure you save to a file using tidySaveFile().

libxml has problems parsing HTML, even if used with xmlRecoverFile() rather than xmlParseFile().

2. libxml: Parse the XHTML, get all elements' attributes (and any other URLs you need) and pass on the URLs to libcurl to download. Need I say more?

Well, actually I should. libxml is a little hard to understand from the API, and sample code to do what you want is hard to find. I had to do quite a bit of searching, looking up sample programs, and then reading the API to figure out how things worked.

3. curl: Or rather libcurl. To retrieve files from the Net. Again, need I say more?

Life would have been simpler if curl had a recursive download function ... or wget had a library I could use ... but then, that's why we computer engineers and students have a life!