Using PHP DOM Functions to Parse PHP and Find Links

Page last updated on 2011 / 04 / 09

When developing websites, there are a million and one reasons that you will find yourself needing to parse some HTML to find snippets of information. On the face of it, most of the time a simple regular expression will do the trick, particularly when you are in control of the HTML you are fetching.

When parsing other peoples HTML, you soon find that the tag soup that makes the World Wide Web results in situations and code segments your regular expression was never built to accommodate, resulting in false positives, false negatives... and generally the unexpected.

PHP's DOM functions are specifically made for XML and X/HTML parsing. So, when you have the need to parse some SGML language, turn to these functions and stay away from regular expressions, the comprehensive DOM library will add, edit and delete any attribute, tag or HTML within tags with its suite of functions.

The following example shows how easy it is to collect hyperlinks from a page or file without the problem of broken HTML, attributes with missing/no quotes, or any other hassle that may impede the collection of links:

  1. <?php
  2. /*
  3.   Using PHP's DOM functions to
  4.   fetch hyperlinks and their anchor text
  5. */
  6. $dom = new DOMDocument;
  7. $dom->loadHTML(file_get_contents('http://www.innvo.com/')); // Fetch innvo.com's home page
  8.  
  9. // echo Links and their anchor text
  10. echo '<pre>';
  11. echo "Link\tAnchor\n";
  12. foreach($dom->getElementsByTagName('a') as $link)
  13. {
  14. $href = $link->getAttribute('href');
  15. $anchor = $link->nodeValue;
  16. echo $href,"\t",$anchor,"\n";
  17. }
  18. echo '</pre>';
  19.  
  20. ?>
  21.  

You may want to use the URL Fetching script in a previous post if you and unable to simply use the file_get_contents(); function used above.


Previous Article
Broken Link Checker: Check Broken Links with PHP
Next Article
Colour Coding PHP Output in HTML




Tweet