There are times when I need to analyse Apache’s logs. Some of them can be quite large. Here, I’ll compare several methods and see if there is a faster way of doing so.
Single thread, one-liner
$ time cat access-combined.log | awk '{print $1}' | sort | uniq -c | sort -nr | head -10 100395 128.199.111.237 14677 69.162.124.229 4054 46.118.124.104 555 2600:3c03::f03c:91ff:fe93:baf8 218 176.36.80.39 76 107.21.1.8 69 192.99.54.14 64 208.115.113.85 63 46.161.9.17 59 178.137.90.202 real 0m0.227s user 0m0.225s sys 0m0.036s
Splitting log file into pieces, and process them in parallel.
$ time ./analyze-log.sh access-combined.log 3 100395|128.199.111.237 14677|69.162.124.229 4054|46.118.124.104 555|2600:3c03::f03c:91ff:fe93:baf8 218|176.36.80.39 76|107.21.1.8 69|192.99.54.14 64|208.115.113.85 63|46.161.9.17 59|178.137.90.202 real 0m0.479s user 0m0.440s sys 0m0.303s
Divide and conquer is actually slower considering the input file is relatively small. There are overheads to split the file and consolidate the results with help of a database.
Script to split and process in parallel
#!/usr/bin/env bash function topSources() { cat $1 | awk '{print $1}' | sort | uniq -c | sort -nr -k1 } export -f topSources LOGFILE=$1 THREADS=$2 LINES=$(wc -l $LOGFILE | awk '{print $1}') SPLITCOUNT=$(($LINES/$THREADS)) split -l $SPLITCOUNT $LOGFILE SPLIT- ls -1 SPLIT* | xargs -n1 -P$THREADS bash -c 'topSources "$@" > $1.tmp' _ rm -f result.db sqlite3 result.db "create table summary(counts integer, ip varchar(20))" ls -1 *.tmp | while read file; do sed -i -e 's/^[ ]*//' $file ( cat((EOF .separator " " .import $file summary EOF ) | sqlite3 result.db done # substitute (( with 2 less than characters. the actual symbol broke this page. sqlite3 result.db "select sum(counts), ip from summary group by ip order by 1 desc limit 10" rm -f *.tmp rm -f SPLIT*