Programming

Persistent Inappeasable Mind

Calendar

Back May '13 Forward
Mo Tu We Th Fr Sa Su
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    

Archives

Statistics

Last entry: 2013-05-13 13:23
274 entries written
90 comments have been made
Programming

Sunday, April 15. 2012

Bash - looping or capturing a multiple line command output

Programming

I tried to loop a command which returns a multi-line result in a bash script. The problem was that the quoted command returned ONE line instead of each line separately. So I ended with one line consisted of multi-line result while I was expecting that the quotes would preserve the spacing. Without quotes I got each "word" in a new line while I was expecting everything in one line (without quotes should replace multiple blanks, tabs and newlines with a single space). Here is the "faulty" script:

#for line in `ls -l`
#for line in $(ls -l)
for line in "$(ls -l)" 
do
    echo $line"XXX"    
done

Not quite sure if this is due to OS X (or BSD) environment but I got around it by piping the command to the while loop:

(ls -l) | while read line
do
     echo $line"xxx"
done

or

while read line; do
    echo $line"xxx"
done < <(ls -l)

Tuesday, March 13. 2012

PHP - get favicon from an URL

Programming

I recently commented on Alan's blog about getting favicon from and URL. The simplest way is getting a domain of an URL and adding "favicon.ico" at the end. The problem arises if:

  • the favicon is not on the root on the host name
  • it has an uncommon name
  • it is not in MS ICO format which (nowadays PNG is very common).

Finding the <link rel="icon"> or <link rel="shortcut icon"> in the DOM of an external URL is hardly possible in Javascript for security reasons (look at the XMLHttpRequest for possibilities). One way of doing it is to access external URLs through proxy.

The simpler way is using some server side scripting language such us PHP and a DOM (HTML) parser. I just finished the script and it worked on a few given URLs (with keeping in mind that it could be improved). I deliberately first look at the DOM of a document (see the first bullet above) as for example a personal web page  http://osebje.famnit.upr.si/~mkljun/ might have a different favicon as the main http://osebje.famnit.upr.si/ server page. Here's the code:

<?php
function getFavicon ($url) {
    $file_headers = @get_headers($url);
    $found = FALSE;
    // 1. CHECK THE DOM FOR THE <link> TAG
    // check if the url exists - if the header returned is not 404
    if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
        $dom = new DOMDocument();
        $dom->strictErrorChecking = FALSE;
        @$dom->loadHTMLfile($url);  //@ to discard all the warnings of malformed htmls
        if (!$dom) {
            $error[]='Error parsing the DOM of the file';
        } else {
            $domxml = simplexml_import_dom($dom);
            //check for the historical rel="shortcut icon"
            if ($domxml->xpath('//link[@rel="shortcut icon"]')) {
                $path = $domxml->xpath('//link[@rel="shortcut icon"]');
                $faviconURL = $path[0]['href'];
                $found == TRUE;
                return $faviconURL;
            //check for the HTML5 rel="icon"
            } else if ($domxml->xpath('//link[@rel="icon"]')) {
                $path = $domxml->xpath('//link[@rel="icon"]');
                $faviconURL = $path[0]['href'];
                $found == TRUE;
                return $faviconURL;
            } else {
                $error[]="The URL does not contain a favicon <link> tag.";
            }
        }

        // 2. CHECK DIRECTLY FOR favicon.ico OR favicon.png FILE
        // the two seem to be most common
        if ($found == FALSE) {
            $parse = parse_url($url);
            $favicon_headers = @get_headers("http://".$parse['host']."/favicon.ico");
            if($favicon_headers[0] != 'HTTP/1.1 404 Not Found') {
                $faviconURL = "/favicon.ico";
                $found == TRUE;
                return $faviconURL;
            }
            $favicon_headers = @get_headers("http://".$parse['host']."/favicon.png");
            if($favicon_headers[0] != 'HTTP/1.1 404 Not Found') {
                $faviconURL = "/favicon.png";
                $found == TRUE;
                return $faviconURL;
            }
            if ($found == FALSE) {
                $error[]= "Files favicon.ico and .png do not exist on the server's root."
            }
        }
    // if the URL does not exists ...
    } else {
        $error[]="URL does not exist";
    }

    if ($found == FALSE && isset($error) ) {
        return $error;
    }
}

// URL in one line 
$tempurl = 'http://stackoverflow.com/questions/1732348/regex-match-open-tags
-except-xhtml-self-contained-tags/1732454#1732454';
$result = getFavicon ($tempurl);
echo $result;
?>

However, the script is very slow and parsing badly structured DOMs returns a bucketful of warnings. Hence the @ before $dom->loadHTMLfile($url).

Although the slowness of the script can be accounted to waiting for server to respond, I wondered if computing times could be improved (see the measured times below).

Another way of finding the appropriate <link> tag is to read the file line by line (assuming the link tag is in one line). I know, I know ... but the <link rel="icon"> is at the beginning of the file and we could exit the loop when we find it. Here's the solution echoing the result (note that here is just the changed if sentence from the above function):

    //check if the url exists
    if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
        //open the pointer to the file 
        $handle = @fopen($url, "r");
        //while the file is not end of file
        while (!feof($handle)) {
            //read next line
            $buffer = fgets($handle, 4096);
            if (strstr($buffer, '<link')) {
                if (strstr($buffer, 'icon')) {
                    $doc=new DOMDocument();
                    $doc->loadHTML('<html><head>'.$buffer.'</head><body></body></html>');
                    $domxml=simplexml_import_dom($doc); 
                    $path=$domxml->xpath('//link');
                    $faviconURL = $path[0]['href'];
                    $found == TRUE;
                    echo $faviconURL;
                    //exit the loop
                    break;
                }
            }
        }
    } 

This version was a bit faster (see user and system times below). I also thought why not giving the regular expressions a try. I know, I know ... regular expression are not meant to parse HTML. But as we know what we are looking for ...

    //check if the url exists
    if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
        $handle = @fopen($url, "r");
        while (!feof($handle)) {
            $buffer = fgets($handle, 4096);
            if (strstr($buffer, '<link')) {
                if (strstr($buffer, 'icon')) {
                    preg_match_all('/href=["\']([^"\']*)["\']/i',$buffer, $array);
                    echo print_r($array);
                    break;
                }
            }
        }
    }

The third solution is comparable to the second. However, the response time from the server was quicker?!? Albeit still slow ... maybe I'm missing something ... but have no time at the moment ... Also, the # of tries I tested each script (around 20) is low to draw any conclusion.

Here are some measured times of running these scripts:

mkljun@pim:~$ time php getFavicon.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m26.881s
user    0m0.048s
sys    0m0.036s
mkljun@pim:~$ time php getFavicon.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m21.531s
user    0m0.052s
sys    0m0.028s
mkljun@pim:~$ time php getFavicon.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m31.562s
user    0m0.052s
sys    0m0.024s
mkljun@pim:~$ time php getFavicon2.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m26.080s
user    0m0.044s
sys    0m0.008s
mkljun@pim:~$ time php getFavicon2.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m25.918s
user    0m0.024s
sys    0m0.028s
mkljun@pim:~$ time php getFavicon2.php 
http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m25.984s
user    0m0.032s
sys    0m0.020s
mkljun@pim:~$ time php getFavicon3.php 
            [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m20.954s
user    0m0.028s
sys    0m0.024s
mkljun@pim:~$ time php getFavicon3.php 
            [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m26.077s
user    0m0.032s
sys    0m0.020s
mkljun@pim:~$ time php getFavicon3.php 
            [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico
real    0m20.884s
user    0m0.028s
sys    0m0.028s


Monday, March 12. 2012

Javascript (Mootools) random date generator

Programming

I've needed a random date generator between two dates. This has been easy with Mootools. Both dates are first converted to unix timestamp format (# of seconds after 1.1.1970), then a random number between the two unix times is generated and converted back to date.



function randomDate(date1, date2) {
   var minD = new Date().parse(date1).format('%s');
   var maxD = new Date().parse(date2).format('%s');
   var random = Number.random(parseInt(minD), parseInt(maxD));
   var randomDate = new Date().parse(random+"000").format('db'); 
}

Which can be called

var randomDateTmp = randomDate('1999-06-08 16:34:52', new Date()); 

Set the format accordingly to http://mootools.net/docs/more/Types/Date#Date:format

This can be easily accomplished in other programming languages.