Parsing HTML documents with Simple Html DOM

I had to get some information from an HTML page, the information required was on nested tables, so I’ve started working on it and my first approach was to get the info needed using regular expressions.

After 2 hours I’ve created a set of regex that worked pretty well, but when I received a new document, the layout was a bit different, and my regex didn’t worked as expected, the content was on the same table but this time it had nested tables, which were a really big problem.

I already have used the PHP DOM extension in the past, but only on XML files, so I’ve started working with this extension and within an hour I had the thing working, and this time the changes in the document didn’t affect the scrapping.

In about an hour I had a larger class with several methods to get all the elements required, but suddenly I was presented with another challenge, again with nested tables, sometimes the number of childs were shorter than expected, I’ve experimented several things until I found this Simple Html Dom it is pretty straight forward, and it does an excellent job scrapping html documents, all the methods I did were replaced by this:

$html   = curl_exec($ch);
$dom = new simple_html_dom();
$dom->load($html);
$items = array();
$tabla = $dom->find('table[cellpadding^=2]', 0);

foreach ($dom->find('table[cellpadding^=2]') as $table) {
    foreach ($table->find('tr') as $tr) {
        $link = trim($tr->find('a', 0)->title);
        if ($link) {
            $item['item']  = $link;
            $item['price'] = trim($tr->children(2)->plaintext);
            $item['bids'] = trim($tr->children(3)->plaintext);
            $items[] = $item;
        }
    }
}