I recently commented on Alan’s blog about getting favicon from and URL. The simplest way is getting a domain of an URL and adding "favicon.ico" at the end. The problem arises if:
- the favicon is not on the root on the host name
- it has an uncommon name
- it is not in MS ICO format which (nowadays PNG is very common).
Finding the <link rel="icon"> or <link rel="shortcut icon"> in the DOM of an external URL is hardly possible in Javascript for security reasons (look at the XMLHttpRequest for possibilities). One way of doing it is to access external URLs through proxy.
The simpler way is using some server side scripting language such us PHP and a DOM (HTML) parser. I just finished the script and it worked on a few given URLs (with keeping in mind that it could be improved). I deliberately first look at the DOM of a document (see the first bullet above) as for example a personal web page http://osebje.famnit.upr.si/~mkljun/ might have a different favicon as the main http://osebje.famnit.upr.si/ server page. Here’s the code:
<?php function getFavicon ($url) { $file_headers = @get_headers($url); $found = FALSE; // 1. CHECK THE DOM FOR THE <link> TAG // check if the url exists - if the header returned is not 404 if($file_headers[0] != 'HTTP/1.1 404 Not Found') { $dom = new DOMDocument(); $dom->strictErrorChecking = FALSE; @$dom->loadHTMLfile($url); //@ to discard all the warnings of malformed htmls if (!$dom) { $error[]='Error parsing the DOM of the file'; } else { $domxml = simplexml_import_dom($dom); //check for the historical rel="shortcut icon" if ($domxml->xpath('//link[@rel="shortcut icon"]')) { $path = $domxml->xpath('//link[@rel="shortcut icon"]'); $faviconURL = $path[0]['href']; $found == TRUE; return $faviconURL; //check for the HTML5 rel="icon" } else if ($domxml->xpath('//link[@rel="icon"]')) { $path = $domxml->xpath('//link[@rel="icon"]'); $faviconURL = $path[0]['href']; $found == TRUE; return $faviconURL; } else { $error[]="The URL does not contain a favicon <link> tag."; } } // 2. CHECK DIRECTLY FOR favicon.ico OR favicon.png FILE // the two seem to be most common if ($found == FALSE) { $parse = parse_url($url); $favicon_headers = @get_headers("http://".$parse['host']."/favicon.ico"); if($favicon_headers[0] != 'HTTP/1.1 404 Not Found') { $faviconURL = "/favicon.ico"; $found == TRUE; return $faviconURL; } $favicon_headers = @get_headers("http://".$parse['host']."/favicon.png"); if($favicon_headers[0] != 'HTTP/1.1 404 Not Found') { $faviconURL = "/favicon.png"; $found == TRUE; return $faviconURL; } if ($found == FALSE) { $error[]= "Files favicon.ico and .png do not exist on the server's root." } } // if the URL does not exists ... } else { $error[]="URL does not exist"; } if ($found == FALSE && isset($error) ) { return $error; } } // URL in one line $tempurl = 'http://stackoverflow.com/questions/1732348/regex-match-open-tags -except-xhtml-self-contained-tags/1732454#1732454'; $result = getFavicon ($tempurl); echo $result; ?>
However, the script is very slow and parsing badly structured DOMs returns a bucketful of warnings. Hence the @ before $dom->loadHTMLfile($url).
Although the slowness of the script can be accounted to waiting for server to respond, I wondered if computing times could be improved (see the measured times below).
Another way of finding the appropriate <link> tag is to read the file line by line (assuming the link tag is in one line). I know, I know … but the <link rel="icon"> is at the beginning of the file and we could exit the loop when we find it. Here’s the solution echoing the result (note that here is just the changed if sentence from the above function):
//check if the url exists if($file_headers[0] != 'HTTP/1.1 404 Not Found') { //open the pointer to the file $handle = @fopen($url, "r"); //while the file is not end of file while (!feof($handle)) { //read next line $buffer = fgets($handle, 4096); if (strstr($buffer, '<link')) { if (strstr($buffer, 'icon')) { $doc=new DOMDocument(); $doc->loadHTML('<html><head>'.$buffer.'</head><body></body></html>'); $domxml=simplexml_import_dom($doc); $path=$domxml->xpath('//link'); $faviconURL = $path[0]['href']; $found == TRUE; echo $faviconURL; //exit the loop break; } } } }
This version was a bit faster (see user and system times below). I also thought why not giving the regular expressions a try. I know, I know … regular expression are not meant to parse HTML. But as we know what we are looking for …
//check if the url exists if($file_headers[0] != 'HTTP/1.1 404 Not Found') { $handle = @fopen($url, "r"); while (!feof($handle)) { $buffer = fgets($handle, 4096); if (strstr($buffer, '<link')) { if (strstr($buffer, 'icon')) { preg_match_all('/href=["\']([^"\']*)["\']/i',$buffer, $array); echo print_r($array); break; } } } }
The third solution is comparable to the second. However, the response time from the server was quicker?!? Albeit still slow … maybe I’m missing something … but have no time at the moment … Also, the # of tries I tested each script (around 20) is low to draw any conclusion.
Here are some measured times of running these scripts:
mkljun@pim:~$ time php getFavicon.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m26.881s user 0m0.048s sys 0m0.036s mkljun@pim:~$ time php getFavicon.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m21.531s user 0m0.052s sys 0m0.028s mkljun@pim:~$ time php getFavicon.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m31.562s user 0m0.052s sys 0m0.024s mkljun@pim:~$ time php getFavicon2.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m26.080s user 0m0.044s sys 0m0.008s mkljun@pim:~$ time php getFavicon2.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m25.918s user 0m0.024s sys 0m0.028s mkljun@pim:~$ time php getFavicon2.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m25.984s user 0m0.032s sys 0m0.020s mkljun@pim:~$ time php getFavicon3.php [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m20.954s user 0m0.028s sys 0m0.024s mkljun@pim:~$ time php getFavicon3.php [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m26.077s user 0m0.032s sys 0m0.020s mkljun@pim:~$ time php getFavicon3.php [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m20.884s user 0m0.028s sys 0m0.028s