Sunday, April 15. 2012
Bash - looping or capturing a multiple line command output
I tried to loop a command which returns a multi-line result in a bash script. The problem was that the quoted command returned ONE line instead of each line separately. So I ended with one line consisted of multi-line result while I was expecting that the quotes would preserve the spacing. Without quotes I got each "word" in a new line while I was expecting everything in one line (without quotes should replace multiple blanks, tabs and newlines with a single space). Here is the "faulty" script:
#for line in `ls -l` #for line in $(ls -l) for line in "$(ls -l)" do echo $line"XXX" done
Not quite sure if this is due to OS X (or BSD) environment but I got around it by piping the command to the while loop:
(ls -l) | while read line do echo $line"xxx" done
or
while read line; do echo $line"xxx" done < <(ls -l)
Tuesday, March 13. 2012
PHP - get favicon from an URL
I recently commented on Alan's blog about getting favicon from and URL. The simplest way is getting a domain of an URL and adding "favicon.ico" at the end. The problem arises if:
- the favicon is not on the root on the host name
- it has an uncommon name
- it is not in MS ICO format which (nowadays PNG is very common).
Finding the <link rel="icon"> or <link rel="shortcut icon"> in the DOM of an external URL is hardly possible in Javascript for security reasons (look at the XMLHttpRequest for possibilities). One way of doing it is to access external URLs through proxy.
The simpler way is using some server side scripting language such us PHP and a DOM (HTML) parser. I just finished the script and it worked on a few given URLs (with keeping in mind that it could be improved). I deliberately first look at the DOM of a document (see the first bullet above) as for example a personal web page http://osebje.famnit.upr.si/~mkljun/ might have a different favicon as the main http://osebje.famnit.upr.si/ server page. Here's the code:
<?php
function getFavicon ($url) {
$file_headers = @get_headers($url);
$found = FALSE;
// 1. CHECK THE DOM FOR THE <link> TAG
// check if the url exists - if the header returned is not 404
if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
$dom = new DOMDocument();
$dom->strictErrorChecking = FALSE;
@$dom->loadHTMLfile($url); //@ to discard all the warnings of malformed htmls
if (!$dom) {
$error[]='Error parsing the DOM of the file';
} else {
$domxml = simplexml_import_dom($dom);
//check for the historical rel="shortcut icon"
if ($domxml->xpath('//link[@rel="shortcut icon"]')) {
$path = $domxml->xpath('//link[@rel="shortcut icon"]');
$faviconURL = $path[0]['href'];
$found == TRUE;
return $faviconURL;
//check for the HTML5 rel="icon"
} else if ($domxml->xpath('//link[@rel="icon"]')) {
$path = $domxml->xpath('//link[@rel="icon"]');
$faviconURL = $path[0]['href'];
$found == TRUE;
return $faviconURL;
} else {
$error[]="The URL does not contain a favicon <link> tag.";
}
}
// 2. CHECK DIRECTLY FOR favicon.ico OR favicon.png FILE
// the two seem to be most common
if ($found == FALSE) {
$parse = parse_url($url);
$favicon_headers = @get_headers("http://".$parse['host']."/favicon.ico");
if($favicon_headers[0] != 'HTTP/1.1 404 Not Found') {
$faviconURL = "/favicon.ico";
$found == TRUE;
return $faviconURL;
}
$favicon_headers = @get_headers("http://".$parse['host']."/favicon.png");
if($favicon_headers[0] != 'HTTP/1.1 404 Not Found') {
$faviconURL = "/favicon.png";
$found == TRUE;
return $faviconURL;
}
if ($found == FALSE) {
$error[]= "Files favicon.ico and .png do not exist on the server's root."
}
}
// if the URL does not exists ...
} else {
$error[]="URL does not exist";
}
if ($found == FALSE && isset($error) ) {
return $error;
}
}
// URL in one line
$tempurl = 'http://stackoverflow.com/questions/1732348/regex-match-open-tags
-except-xhtml-self-contained-tags/1732454#1732454';
$result = getFavicon ($tempurl);
echo $result;
?>
However, the script is very slow and parsing badly structured DOMs returns a bucketful of warnings. Hence the @ before $dom->loadHTMLfile($url).
Although the slowness of the script can be accounted to waiting for server to respond, I wondered if computing times could be improved (see the measured times below).
Another way of finding the appropriate <link> tag is to read the file line by line (assuming the link tag is in one line). I know, I know ... but the <link rel="icon"> is at the beginning of the file and we could exit the loop when we find it. Here's the solution echoing the result (note that here is just the changed if sentence from the above function):
//check if the url exists
if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
//open the pointer to the file
$handle = @fopen($url, "r");
//while the file is not end of file
while (!feof($handle)) {
//read next line
$buffer = fgets($handle, 4096);
if (strstr($buffer, '<link')) {
if (strstr($buffer, 'icon')) {
$doc=new DOMDocument();
$doc->loadHTML('<html><head>'.$buffer.'</head><body></body></html>');
$domxml=simplexml_import_dom($doc);
$path=$domxml->xpath('//link');
$faviconURL = $path[0]['href'];
$found == TRUE;
echo $faviconURL;
//exit the loop
break;
}
}
}
}
This version was a bit faster (see user and system times below). I also thought why not giving the regular expressions a try. I know, I know ... regular expression are not meant to parse HTML. But as we know what we are looking for ...
//check if the url exists
if($file_headers[0] != 'HTTP/1.1 404 Not Found') {
$handle = @fopen($url, "r");
while (!feof($handle)) {
$buffer = fgets($handle, 4096);
if (strstr($buffer, '<link')) {
if (strstr($buffer, 'icon')) {
preg_match_all('/href=["\']([^"\']*)["\']/i',$buffer, $array);
echo print_r($array);
break;
}
}
}
}
The third solution is comparable to the second. However, the response time from the server was quicker?!? Albeit still slow ... maybe I'm missing something ... but have no time at the moment ... Also, the # of tries I tested each script (around 20) is low to draw any conclusion.
Here are some measured times of running these scripts:
mkljun@pim:~$ time php getFavicon.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m26.881s user 0m0.048s sys 0m0.036s mkljun@pim:~$ time php getFavicon.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m21.531s user 0m0.052s sys 0m0.028s mkljun@pim:~$ time php getFavicon.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m31.562s user 0m0.052s sys 0m0.024s mkljun@pim:~$ time php getFavicon2.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m26.080s user 0m0.044s sys 0m0.008s mkljun@pim:~$ time php getFavicon2.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m25.918s user 0m0.024s sys 0m0.028s mkljun@pim:~$ time php getFavicon2.php http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m25.984s user 0m0.032s sys 0m0.020s mkljun@pim:~$ time php getFavicon3.php [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m20.954s user 0m0.028s sys 0m0.024s mkljun@pim:~$ time php getFavicon3.php [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m26.077s user 0m0.032s sys 0m0.020s mkljun@pim:~$ time php getFavicon3.php [0] => http://cdn.sstatic.net/stackoverflow/img/favicon.ico real 0m20.884s user 0m0.028s sys 0m0.028s
Monday, March 12. 2012
Javascript (Mootools) random date generator
I've needed a random date generator between two dates. This has been easy with Mootools. Both dates are first converted to unix timestamp format (# of seconds after 1.1.1970), then a random number between the two unix times is generated and converted back to date.
function randomDate(date1, date2) {
var minD = new Date().parse(date1).format('%s');
var maxD = new Date().parse(date2).format('%s');
var random = Number.random(parseInt(minD), parseInt(maxD));
var randomDate = new Date().parse(random+"000").format('db');
}
Which can be called
var randomDateTmp = randomDate('1999-06-08 16:34:52', new Date());
Set the format accordingly to http://mootools.net/docs/more/Types/Date#Date:format
This can be easily accomplished in other programming languages.

