April 30th, 2008 – See Popular Posts for a sample iPhone web page which uses web server log scanning described here to create a clickable top ten list of the most popular weblog titles.
***
In an earlier post, Log Parsing II, I described scanning the Apache access log with Perl to build an html file containing Google searches of your site. Here’s a link to the complete example script:
Example Script – Googles Searches
The example script runs on Mac or Windows PC. It downloads the latest access logs to a local folder containing an archive of previously downloaded logs. It scans every log in the local archive folder. References to Google searches are written to an output file. When the local archive folder is completely scanned, the output HTML file is uploaded to the service provider’s host. The script should be run once per day. Many service providers keep the current access log plus one or two rotated logs.
***
Before running the script, change hostname, login name and password as needed. Also, change the script name from googleSearches_pl.txt to googleSearches.pl. If you don’t have Perl, you can download it free from ActiveState. Mac and Windows versions of Perl are availible for download. With Perl installed locally, the command to run the log scanning script is perl googleSearches.pl.
***
Output from the script should look something like [this]. The script contains a function named htmlBegin(). Use this function to set ”page title”, ”body title”, and “banner image” to any values desired.
***
Improving the Script
The obvious place to improve the script is to replace the multi-line regex with a single line regex for parsing the Apache log. A web page named A Simple Apache Log Parser contains an example of a single line Apache line parsing regex which looks promising. More to come……
***
About the Script
Reducing Noise – Access logs contains lots of “noise” that’s created when a web site is accessed by robots or when requests download pages composed of multiple images or javascript include files. Reducing noise provides a more realistic view of visits to the site.
***
Weblog noise reduction can be done by searching for and rejecting lines in the log containing words or strings we don’t really care about. How do you determine which words or strings indicate a line should be rejected? The best way may be to just look through the unfiltered access log. Simply eyeballing the unfiltered log will reveal plently of lines that can be classified as noise.
***
Once noise lines are identfied, choose words or strings that occur only in lines of noise and load the strings into an array or hash. Then build a function around it. The sample function below is from iPhone Cafe. Character strings are pushed onto the @discard array and a regex evaluates true for lines containing discard strings which returns false from the perl function causing the calling script to skipt the current line read another line from the access log.
#—————————————————————–
# Function: discard
# Purpose: Return true if the input line contains a character
# string indicating we dont care about the line of text.
#—————————————————————–
sub discard
{
my $lineOfText = shift;
my $weDontCare = 0; # initialyze return variable
my @discard; # initialyze array to hold discard strings
# Load the discard array
push(@discard, “ocadia”);
push(@discard, “themes”);
push(@discard, ‘24\.18\.’);
push(@discard, ‘76\.114\.206′);
push(@discard, ‘\.css’);
push(@discard, ‘\\/js’);
push(@discard, ‘\.png’);
push(@discard, ‘favicon\.ico’);
push(@discard, ‘\/image\/background’);
push(@discard, ‘wp-admin\/images’);
# Compare the input line to the list of discard strings
foreach my $discardString (@discard) {
if ($lineOfText =~ /$discardString/i) {
$weDontCare = 1; # this line contains a word that eliminates it.
}
}
return $weDontCare;
}
Related Pages:
Viruses, Spyware and other Nasties