Extract search strings from Apache log files

  • Tip Category - Searching
  • Tip Added By - MartinKaufmann - 17 Jun 2010 - 13:41
  • Extensions Used - None
  • Useful To - Beginners
  • Tip Status - New
  • Related Topics -

Problem

To know what search strings are used to search for content can help (a) to understand what information is missing from your knowledge management system or (b) to instruct your users on how to use the search.

Context

Usability analysis

Solution

The following description assumes that Apache on Linux is used. The search strings are stored in the Apache access log. To retrieve them, the following script can be used:
#!/bin/bash
OUTPUT=search_strings.txt
# delete output file
if [ -e "$OUTPUT" ]; then rm $OUTPUT; fi 
for i in access.log*
do
        # check if file extension is "gz"
        if [ ${i##*.} == "gz" ] 
        then
                command=zcat
        else
                command=cat
        fi
        # cat/zcat file, take 7th column (separated by space -> after GET), search for "WebSearch?search=", get rid of everything before and after the actual search string (1st sed), urldecode (2nd sed) and write to file
        $command $i | awk '{print $7}' | grep "WebSearch?search="  | sed 's/.*search=//g; s/&.*//g' | echo -e "$(sed 's/+/ /g; s/%/\\x/g')" >> $OUTPUT
done
Copy the above code to a file in the folder where the Apache log files are stored. Running this script will generate a file search_strings.txt which contains all the search strings. To get some statistics run the following command:
cat search_strings.txt | awk '{count[$1]++;} END {for(w in count){print count[w] ": " w};}' FS='\n' | sort -nr > search_strings_sorted.txt

-- 17 Jun 2010 - 13:41:56 - MartinKaufmann

OlivierRaginel wrote some perl script to do the same, more or less:
#!/usr/bin/env perl

my %h;
while ( my $f = shift ) {
    my $fh;
    if ( $f =~ /\.gz$/ ) {
        open( $fh, "-|", "gzip -dc $f" ) or die "Cannot gzip -dc $f: $!";
    }
    else {
        open( $fh, "<", $f ) or die "Cannot open $f: $!";
    }
    while (<$fh>) {
        next
          unless m#^(?:\S+ ){5}"GET /(\S+)/WebSearch\?search=([^&]*)\S* HTTP\S+#;
        my ( $web, $string ) = ( $1, $2 );
        $string =~ s/%([0-9A-F]{2})/chr(hex($1))/gei;
        $string =~ s/\+/ /go;
        $h{$web}{$string}++;
    }
    close $fh;
}
 
for my $web ( sort keys %h ) {
    print "$web - ",
      join( ", ",
        map    { "$_: $h{$web}{$_}" }
          sort { $h{$web}{$b} <=> $h{$web}{$a} } keys %{ $h{$web} } ),
      "\n";
}

This version differentiates between webs, at least it shows from which web the search was made. I had written another part to extract in which webs it was searched, but I'm not sure it's useful. You just run the script passing it all the files you want as argument, and it will output you something like:
Web1 - searchterm1: 4, search term 2: 3
Web3 - "Another search term": 1
Output format is in the print at the end, so you can tune it to fit your needs.

-- 17 Jun 2010 - 16:25:21 - OlivierRaginel

Known Uses

Known Limitations

The first script does not differentiate between webs (if the search is limited to an individual web) nor does it distinguish between a successful and unsuccessful search.

See Also

Topic revision: r3 - 17 Jun 2010, MartinKaufmann
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy