Divide and conquer – blog.headdesk.me

There are times when I need to analyse Apache’s logs. Some of them can be quite large. Here, I’ll compare several methods and see if there is a faster way of doing so.

Single thread, one-liner

$ time cat access-combined.log  | awk '{print $1}' | sort | uniq -c | sort -nr | head -10
100395 128.199.111.237
14677 69.162.124.229
4054 46.118.124.104
 555 2600:3c03::f03c:91ff:fe93:baf8
 218 176.36.80.39
  76 107.21.1.8
  69 192.99.54.14
  64 208.115.113.85
  63 46.161.9.17
  59 178.137.90.202

real	0m0.227s
user	0m0.225s
sys	0m0.036s

Splitting log file into pieces, and process them in parallel.

$ time ./analyze-log.sh access-combined.log 3
100395|128.199.111.237
14677|69.162.124.229
4054|46.118.124.104
555|2600:3c03::f03c:91ff:fe93:baf8
218|176.36.80.39
76|107.21.1.8
69|192.99.54.14
64|208.115.113.85
63|46.161.9.17
59|178.137.90.202

real	0m0.479s
user	0m0.440s
sys	0m0.303s

Divide and conquer is actually slower considering the input file is relatively small. There are overheads to split the file and consolidate the results with help of a database.

Script to split and process in parallel

#!/usr/bin/env bash

function topSources() {
	cat $1 | awk '{print $1}' | sort | uniq -c | sort -nr -k1
}

export -f topSources

LOGFILE=$1
THREADS=$2
LINES=$(wc -l $LOGFILE | awk '{print $1}')
SPLITCOUNT=$(($LINES/$THREADS))
split -l $SPLITCOUNT $LOGFILE SPLIT-
ls -1 SPLIT* | xargs -n1 -P$THREADS bash -c 'topSources "$@" > $1.tmp' _
rm -f result.db
sqlite3 result.db "create table summary(counts integer, ip varchar(20))"
ls -1 *.tmp | while read file; do
sed -i -e 's/^[ ]*//' $file
( cat((EOF
.separator " "
.import $file summary
EOF
) | sqlite3 result.db
done

# substitute (( with 2 less than characters. the actual symbol broke this page.

sqlite3 result.db "select sum(counts), ip from summary group by ip order by 1 desc limit 10"

rm -f *.tmp
rm -f SPLIT*

Leave a Reply Cancel reply