Broken Link Checker Using PHP and cURL

Page last updated on 2011 / 04 / 09

Whether operating a commercial site, a directory, or a personal site, it is important to ensure you do not have 'dead' links on your website. Broken links; links that point to inactive domains or 404 pages are of little use to your site visitors and may jeapordise any good search engine rankings you have, as it can be inferred your site is not well maintained while having broken links on it.

To remedy any potential problem, using a script to periodically check links on your pages means you can quickly alter & remove links that are no longer active or useful.

The following script will do this task for you, using PHP and cURL from the command line, with a simple HTML parser to find links on a page. Simply enter a URL into the form, and the results will appear on an IFrame in the same page.

For the script to work correctly, ensure the following:

  1. <?php
  2. class html_parser
  3. {
  4. // A function to convert relative links to absolute links
  5. public function rel2abs($rel,$base)
  6. {
  7. @$p = parse_url($rel);
  8. if(!$rel)
  9. RETURN $base;
  10. if(isset($p['scheme']) && $p['scheme'])
  11. {
  12. if(!isset($p['path']))
  13. {
  14. if(isset($p['query']))
  15. $rel = preg_replace("'?'",'/?',$rel,1);
  16. else
  17. $rel .= '/';
  18. }
  19. return $rel; /* return if already absolute URL */
  20. }
  21. if($rel[0]=='#' || $rel[0]=='?') return $base.$rel; /* queries and anchors */
  22. extract(parse_url($base)); /* parse base URL and convert to local variables:$scheme, $host, $path */
  23. $path = preg_replace('#/[^/]*$#', '', $path); /* remove non-directory element from path */
  24. if ($rel[0] == '/') $path = ''; /* destroy path if relative url points to root */
  25. $abs = "$host$path/$rel"; /* dirty absolute URL */
  26. $re = array('#(/.?/)#', '#/(?!..)[^/]+/../#'); /* replace '//' or '/./' or '/foo/../' with '/' */
  27. for($n=1; $n>0; $abs=preg_replace($re, '/', $abs, -1, $n)) {}
  28. return $scheme.'://'.$abs; /* absolute URL is ready! */
  29. }
  30. // DOM functions used to find URLs
  31. function parse_for_links($dom,$url,$tag,$attr,&$i)
  32. {
  33. foreach($dom->getElementsByTagName($tag) as $link)
  34. {
  35. $href = $link->getAttribute($attr);
  36. if(!strlen($href) || $href[0] == '#' || preg_match("'^javascript'i",$href))
  37. CONTINUE;
  38. $href = preg_replace(array("'^[^:]+://'","'#.+$'"),'',$this->rel2abs($href,$url));
  39. if(isset($done[$href]))
  40. CONTINUE;
  41. $anchor = $link->nodeValue;
  42. $string = 'curl -I -A "Broken Link Checker" -s --max-redirs 5 -m 5 --retry 1 --retry-delay 10 -w "%{url_effective}t%{http_code}t%{time_total}" -o /tmp/temp2.txt '.escapeshellarg($href);
  43. $string = explode("\t",`$string`);
  44. if($string[1][0] == '2')
  45. $color = 'green';
  46. elseif($string[1][0] == '3')
  47. $color = 'yellow';
  48. else
  49. $color = 'red';
  50. echo (++$i).'. '.str_pad($string[0],50,' ',STR_PAD_RIGHT)." <font color="$color">".$string[1]."</font> ".$string[2]."s\n";
  51. $done[$href] = TRUE;
  52. if($i > 100) // Limiting to 100 URLs, you can change this to suit your needs.
  53. BREAK;
  54. flush();
  55. }
  56. }
  57. }
  58.  
  59. // Loads up an Iframe with some default text
  60. if(isset($_GET['iframe']))
  61. {
  62. echo 'Results will appear here';
  63. exit(0);
  64. }
  65. // You have submitted a URL to check
  66. if(isset($_POST['url'],$_POST['choice']))
  67. {
  68. @$url = parse_url($_POST['url']);
  69. if(!isset($url['host']))
  70. echo 'The URL you provided was invalid, please submit a valid URL';
  71. else
  72. {
  73. // Prepare the command to send to cURL
  74. $string = 'curl -A "Broken Link Checker" -s --max-redirs 5 -m 5 --retry 1 --retry-delay 10 -w "%{url_effective}t%{http_code}t%{size_download}t%{time_total}" -o /tmp/temp.txt '.escapeshellarg($_POST['url']);
  75. // Check the HTTP response
  76. $string = explode("\t",`$string`);
  77. if($string[1][0] == '2')
  78. $color = 'green';
  79. elseif($string[1][0] == '3')
  80. $color = 'yellow';
  81. else
  82. $color = 'red';
  83. echo '<sup>Fetched '.$string[0].' ('.$string[2].' bytes) in '.$string[3].' seconds and returned a <font color="'.$color.'">'.$string[1].'</font> response';
  84. echo '<pre><br />';
  85. $_html_parser = new html_parser;
  86. $dom = new DOMDocument;
  87. @$dom->loadHTML(file_get_contents('/tmp/temp.txt'));
  88. $i = 0;
  89. if($_POST['choice'] == 'Check Links') // Checking <a> and <area> references
  90. {
  91. $_html_parser->parse_for_links($dom,$_POST['url'],'a','href',$i);
  92. $_html_parser->parse_for_links($dom,$_POST['url'],'area','href',$i);
  93. }
  94. elseif($_POST['choice'] == 'Check Files') // Checking <link>, <script> and <img> references
  95. {
  96. $_html_parser->parse_for_links($$dom,$_POST['url'],'link','href',$i);
  97. $_html_parser->parse_for_links($dom,$_POST['url'],'script','src',$i);
  98. $_html_parser->parse_for_links($dom,$_POST['url'],'img','src',$i);
  99. }
  100.  
  101. if(is_file('/tmp/temp.txt'))
  102. unlink('/tmp/temp.txt');
  103. if(is_file('/tmp/temp2.txt'))
  104. unlink('/tmp/temp2.txt');
  105. }
  106. exit(0);
  107. }
  108. // Introductory text
  109. echo '<p>Use the tool below to see if there are any broken links on your site.</p>
  110. <p>The URL you provide will be fetched and then parsed for links and images. Each of those will then be fetched with a 5 second limit, which is ample time. Any requests that aren't made in that time should be classed as broken or currently unavailable. A typical request usually takes a quarter of a second. A maximum of 100 links/files will be checked</p>
  111. <p><font color="green">Green</font> indicates a healthy link, <font color="yellow">yellow</font> indicates a redirect, which may or may not lead to a healthy page. Finally, <font color="red">red</font> indicates a broken link, by either pointing to a defunct page or to a server that is unresponsive.</p>
  112. <p>If the URL you submit is a redirect and returns one URL, try using that URL instead.</p>
  113.  
  114. <form method="post" target="iframe">
  115. <p>Enter a URL: <input type="text" name="url" value="" style="width:50%" />
  116. <select name="choice"><option>Check Links</option><option>Check Files</option></select>
  117. <input type="submit" value="Check It!" />
  118. </p></form>
  119. <div align="center"><iframe id="iframe" name="iframe" style="width:95%" src="website.php?iframe=1"></iframe></div>
  120. <p>Please bear in mind that it is possible that different users receive different responses from the links you see above. The results you see should be indicative only and second checked manually should further investigation be required.</p>';
  121. ?>
  122.  

Note that you can also search for scripts, images and CSS files, rather than simply hyperlinks.


Previous Article
A Quick and Efficient URL Shortener Script, Using PHP & MySQL
Next Article
Parsing HTML with PHP




Tweet