PHP - get favicon from an URL

I recently commented on Alan's blog about getting favicon from and URL. The simplest way is getting a domain of an URL and adding "favicon.ico" at the end. The problem arises if:

  • the favicon is not on the root on the host name
  • it has an uncommon name
  • it is not in MS ICO format which (nowadays PNG is very common).

Finding the <link rel="icon"> or <link rel="shortcut icon"> in the DOM of an external URL is hardly possible in Javascript for security reasons (look at the XMLHttpRequest for possibilities). One way of doing it is to access external URLs through proxy.

The simpler way is using some server side scripting language such us PHP and a DOM (HTML) parser. I just finished the script and it worked on a few given URLs (with keeping in mind that it could be improved). I deliberately first look at the DOM of a document (see the first bullet above) as for example a personal web page  http://osebje.famnit.upr.si/~mkljun/ might have a different favicon as the main http://osebje.famnit.upr.si/ server page. Here's the code:

<?php
function getFavicon ($url) {
    $file_headers = @get_headers($url);
    $found = FALSE;
    // 1. CHECK THE DOM FOR THE <link> TAG
    // check if the url exists - if the header returned is not 404
    if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
        $dom = new DOMDocument();
        $dom->strictErrorChecking = FALSE;
        @$dom->loadHTMLfile($url);  //@ to discard all the warnings of malformed htmls
        if (!$dom) {
            $error[]='Error parsing the DOM of the file';
        } else {
            $domxml = simplexml_import_dom($dom);
            //check for the historical rel="shortcut icon"
            if ($domxml->xpath('//link[@rel="shortcut icon"]')) {
                $path = $domxml->xpath('//link[@rel="shortcut icon"]');
                $faviconURL = $path[0]['href'];
                $found == TRUE;
                return $faviconURL;
            //check for the HTML5 rel="icon"
            } else if ($domxml->xpath('//link[@rel="icon"]')) {
                $path = $domxml->xpath('//link[@rel="icon"]');
                $faviconURL = $path[0]['href'];
                $found == TRUE;
                return $faviconURL;
            } else {
                $error[]="The URL does not contain a favicon <link> tag.";
            }
        }

        // 2. CHECK DIRECTLY FOR favicon.ico OR favicon.png FILE
        // the two seem to be most common
        if ($found == FALSE) {
            $parse = parse_url($url);
            $favicon_headers = @get_headers("http://".$parse['host']."/favicon.ico");
            if($favicon_headers[0] != 'HTTP/1.1 404 Not Found') {
                $faviconURL = "/favicon.ico";
                $found == TRUE;
                return $faviconURL;
            }
            $favicon_headers = @get_headers("http://".$parse['host']."/favicon.png");
            if($favicon_headers[0] != 'HTTP/1.1 404 Not Found') {
                $faviconURL = "/favicon.png";
                $found == TRUE;
                return $faviconURL;
            }
            if ($found == FALSE) {
                $error[]= "Files favicon.ico and .png do not exist on the server's root."
            }
        }
    // if the URL does not exists ...
    } else {
        $error[]="URL does not exist";
    }

    if ($found == FALSE && isset($error) ) {
        return $error;
    }
}

// URL in one line 
$tempurl = 'http://stackoverflow.com/questions/1732348/regex-match-open-tags
-except-xhtml-self-contained-tags/1732454#1732454';
$result = getFavicon ($tempurl);
echo $result;
?>

However, the script is very slow and parsing badly structured DOMs returns a bucketful of warnings. Hence the @ before $dom->loadHTMLfile($url).

Although the slowness of the script can be accounted to waiting for server to respond, I wondered if computing times could be improved (see the measured times below).

Another way of finding the appropriate <link> tag is to read the file line by line (assuming the link tag is in one line). I know, I know ... but the <link rel="icon"> is at the beginning of the file and we could exit the loop when we find it. Here's the solution echoing the result (note that here is just the changed if sentence from the above function):

    //check if the url exists
    if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
        //open the pointer to the file 
        $handle = @fopen($url, "r");
        //while the file is not end of file
        while (!feof($handle)) {
            //read next line
            $buffer = fgets($handle, 4096);
            if (strstr($buffer, '<link')) {
                if (strstr($buffer, 'icon')) {
                    $doc=new DOMDocument();
                    $doc->loadHTML('<html><head>'.$buffer.'</head><body></body></html>');
                    $domxml=simplexml_import_dom($doc); 
                    $path=$domxml->xpath('//link');
                    $faviconURL = $path[0]['href'];
                    $found == TRUE;
                    echo $faviconURL;
                    //exit the loop
                    break;
                }
            }
        }
    } 

This version was a bit faster (see user and system times below). I also thought why not giving the regular expressions a try. I know, I know ... regular expression are not meant to parse HTML. But as we know what we are looking for ...

    //check if the url exists
    if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
        $handle = @fopen($url, "r");
        while (!feof($handle)) {
            $buffer = fgets($handle, 4096);
            if (strstr($buffer, '<link')) {
                if (strstr($buffer, 'icon')) {
                    preg_match_all('/href=["\']([^"\']*)["\']/i',$buffer, $array);
                    echo print_r($array);
                    break;
                }
            }
        }
    }

The third solution is comparable to the second. However, the response time from the server was quicker?!? Albeit still slow ... maybe I'm missing something ... but have no time at the moment ... Also, the # of tries I tested each script (around 20) is low to draw any conclusion.

Here are some measured times of running these scripts:

mkljun@pim:~$ time php getFavicon.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m26.881s
user    0m0.048s
sys    0m0.036s
mkljun@pim:~$ time php getFavicon.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m21.531s
user    0m0.052s
sys    0m0.028s
mkljun@pim:~$ time php getFavicon.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m31.562s
user    0m0.052s
sys    0m0.024s
mkljun@pim:~$ time php getFavicon2.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m26.080s
user    0m0.044s
sys    0m0.008s
mkljun@pim:~$ time php getFavicon2.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m25.918s
user    0m0.024s
sys    0m0.028s
mkljun@pim:~$ time php getFavicon2.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m25.984s
user    0m0.032s
sys    0m0.020s
mkljun@pim:~$ time php getFavicon3.php 
            [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m20.954s
user    0m0.028s
sys    0m0.024s
mkljun@pim:~$ time php getFavicon3.php 
            [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m26.077s
user    0m0.032s
sys    0m0.020s
mkljun@pim:~$ time php getFavicon3.php 
            [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m20.884s
user    0m0.028s
sys    0m0.028s


Trackbacks

Trackback specific URI for this entry

This link is not meant to be clicked. It contains the trackback URI for this entry. You can use this URI to send ping- & trackbacks from your own blog to this entry. To copy the link, right click and select "Copy Shortcut" in Internet Explorer or "Copy Link Location" in Mozilla.

No Trackbacks

Comments

Display comments as Linear | Threaded

Alan Dix on :

For faster performance, probably worth just dropping down to text search for the link tag. I'd suggest the SAX streaming XML parser as this would then only parse until the link tag rather than the entire docuemnt, but not good for HTML. There does seem to be PEAR HTML SAX=like parser: XML_HTMLSax.

Alternatively could use embed.ly JSONP API. For some reason it failed for the stackoverflow example in your code, but worked with one of my menadeviation apegs and gave the correct favicon as 'thumbnail url':
http://embed.ly/docs/explore/oembed?url=http%3A%2F%2Fwww.meandeviation.com%2Fbooknotes%2F

I assume behind the scenes it will be parsing the page etc., although I'm sure caching the result.

BTW. if you use my client side js alongside your server-side script, you can make sure page renders with default icon even if the backend script takes a while.


Alan

Matjaž on :

Hi Alan ... I edited the post live and you commented before I finished. I didn't realize it :).

Good points. Thx ... will have to try some other possibilities when I'll get a chance.

Alan Dix on :

Looking at the timings, there is lots of real elapsed time, but virtually no actual computation time, so it is not the parse time just the waiting for web access that takes time. Maybe switching curl for the separate get_headers and fopen would drop one http request, after that I'd AJAX-ify the script you have so that, while the first bit needs to be server side, the latency happens client-side and the last bit gets managed by img.onload event as I did in my simple js script.

Add Comment

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

Can you please write (or copy/paste) this text in the field below: i h a t e s p a m