PHP – get favicon from an URL

I recently commented on Alan’s blog about getting favicon from and URL. The simplest way is getting a domain of an URL and adding "favicon.ico" at the end. The problem arises if:

  • the favicon is not on the root on the host name
  • it has an uncommon name
  • it is not in MS ICO format which (nowadays PNG is very common).

Finding the <link rel="icon"> or <link rel="shortcut icon"> in the DOM of an external URL is hardly possible in Javascript for security reasons (look at the XMLHttpRequest for possibilities). One way of doing it is to access external URLs through proxy.

The simpler way is using some server side scripting language such us PHP and a DOM (HTML) parser. I just finished the script and it worked on a few given URLs (with keeping in mind that it could be improved). I deliberately first look at the DOM of a document (see the first bullet above) as for example a personal web page  http://osebje.famnit.upr.si/~mkljun/ might have a different favicon as the main http://osebje.famnit.upr.si/ server page. Here’s the code:

<?php
function getFavicon ($url) {
    $file_headers = @get_headers($url);
    $found = FALSE;
    // 1. CHECK THE DOM FOR THE <link> TAG
    // check if the url exists - if the header returned is not 404
    if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
        $dom = new DOMDocument();
        $dom->strictErrorChecking = FALSE;
        @$dom->loadHTMLfile($url);  //@ to discard all the warnings of malformed htmls
        if (!$dom) {
            $error[]='Error parsing the DOM of the file';
        } else {
            $domxml = simplexml_import_dom($dom);
            //check for the historical rel="shortcut icon"
            if ($domxml->xpath('//link[@rel="shortcut icon"]')) {
                $path = $domxml->xpath('//link[@rel="shortcut icon"]');
                $faviconURL = $path[0]['href'];
                $found == TRUE;
                return $faviconURL;
            //check for the HTML5 rel="icon"
            } else if ($domxml->xpath('//link[@rel="icon"]')) {
                $path = $domxml->xpath('//link[@rel="icon"]');
                $faviconURL = $path[0]['href'];
                $found == TRUE;
                return $faviconURL;
            } else {
                $error[]="The URL does not contain a favicon <link> tag.";
            }
        }

        // 2. CHECK DIRECTLY FOR favicon.ico OR favicon.png FILE
        // the two seem to be most common
        if ($found == FALSE) {
            $parse = parse_url($url);
            $favicon_headers = @get_headers("http://".$parse['host']."/favicon.ico");
            if($favicon_headers[0] != 'HTTP/1.1 404 Not Found') {
                $faviconURL = "/favicon.ico";
                $found == TRUE;
                return $faviconURL;
            }
            $favicon_headers = @get_headers("http://".$parse['host']."/favicon.png");
            if($favicon_headers[0] != 'HTTP/1.1 404 Not Found') {
                $faviconURL = "/favicon.png";
                $found == TRUE;
                return $faviconURL;
            }
            if ($found == FALSE) {
                $error[]= "Files favicon.ico and .png do not exist on the server's root."
            }
        }
    // if the URL does not exists ...
    } else {
        $error[]="URL does not exist";
    }

    if ($found == FALSE && isset($error) ) {
        return $error;
    }
}

// URL in one line 
$tempurl = 'http://stackoverflow.com/questions/1732348/regex-match-open-tags
-except-xhtml-self-contained-tags/1732454#1732454';
$result = getFavicon ($tempurl);
echo $result;
?>

However, the script is very slow and parsing badly structured DOMs returns a bucketful of warnings. Hence the @ before $dom->loadHTMLfile($url).

Although the slowness of the script can be accounted to waiting for server to respond, I wondered if computing times could be improved (see the measured times below).

Another way of finding the appropriate <link> tag is to read the file line by line (assuming the link tag is in one line). I know, I know … but the <link rel="icon"> is at the beginning of the file and we could exit the loop when we find it. Here’s the solution echoing the result (note that here is just the changed if sentence from the above function):

    //check if the url exists
    if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
        //open the pointer to the file 
        $handle = @fopen($url, "r");
        //while the file is not end of file
        while (!feof($handle)) {
            //read next line
            $buffer = fgets($handle, 4096);
            if (strstr($buffer, '<link')) {
                if (strstr($buffer, 'icon')) {
                    $doc=new DOMDocument();
                    $doc->loadHTML('<html><head>'.$buffer.'</head><body></body></html>');
                    $domxml=simplexml_import_dom($doc); 
                    $path=$domxml->xpath('//link');
                    $faviconURL = $path[0]['href'];
                    $found == TRUE;
                    echo $faviconURL;
                    //exit the loop
                    break;
                }
            }
        }
    } 

This version was a bit faster (see user and system times below). I also thought why not giving the regular expressions a try. I know, I know … regular expression are not meant to parse HTML. But as we know what we are looking for …

    //check if the url exists
    if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
        $handle = @fopen($url, "r");
        while (!feof($handle)) {
            $buffer = fgets($handle, 4096);
            if (strstr($buffer, '<link')) {
                if (strstr($buffer, 'icon')) {
                    preg_match_all('/href=["\']([^"\']*)["\']/i',$buffer, $array);
                    echo print_r($array);
                    break;
                }
            }
        }
    }

The third solution is comparable to the second. However, the response time from the server was quicker?!? Albeit still slow … maybe I’m missing something … but have no time at the moment … Also, the # of tries I tested each script (around 20) is low to draw any conclusion.

Here are some measured times of running these scripts:

mkljun@pim:~$ time php getFavicon.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m26.881s
user    0m0.048s
sys    0m0.036s
mkljun@pim:~$ time php getFavicon.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m21.531s
user    0m0.052s
sys    0m0.028s
mkljun@pim:~$ time php getFavicon.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m31.562s
user    0m0.052s
sys    0m0.024s
mkljun@pim:~$ time php getFavicon2.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m26.080s
user    0m0.044s
sys    0m0.008s
mkljun@pim:~$ time php getFavicon2.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m25.918s
user    0m0.024s
sys    0m0.028s
mkljun@pim:~$ time php getFavicon2.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m25.984s
user    0m0.032s
sys    0m0.020s
mkljun@pim:~$ time php getFavicon3.php 
            [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m20.954s
user    0m0.028s
sys    0m0.024s
mkljun@pim:~$ time php getFavicon3.php 
            [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m26.077s
user    0m0.032s
sys    0m0.020s
mkljun@pim:~$ time php getFavicon3.php 
            [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m20.884s
user    0m0.028s
sys    0m0.028s