Menu
blog.headdesk.me
blog.headdesk.me

Divide and conquer

Posted on 2016/04/282016/08/24

There are times when I need to analyse Apache’s logs. Some of them can be quite large. Here, I’ll compare several methods and see if there is a faster way of doing so.

Single thread, one-liner

$ time cat access-combined.log  | awk '{print $1}' | sort | uniq -c | sort -nr | head -10
100395 128.199.111.237
14677 69.162.124.229
4054 46.118.124.104
 555 2600:3c03::f03c:91ff:fe93:baf8
 218 176.36.80.39
  76 107.21.1.8
  69 192.99.54.14
  64 208.115.113.85
  63 46.161.9.17
  59 178.137.90.202

real	0m0.227s
user	0m0.225s
sys	0m0.036s

Splitting log file into pieces, and process them in parallel.

$ time ./analyze-log.sh access-combined.log 3
100395|128.199.111.237
14677|69.162.124.229
4054|46.118.124.104
555|2600:3c03::f03c:91ff:fe93:baf8
218|176.36.80.39
76|107.21.1.8
69|192.99.54.14
64|208.115.113.85
63|46.161.9.17
59|178.137.90.202

real	0m0.479s
user	0m0.440s
sys	0m0.303s

Divide and conquer is actually slower considering the input file is relatively small. There are overheads to split the file and consolidate the results with help of a database.

Script to split and process in parallel

#!/usr/bin/env bash

function topSources() {
	cat $1 | awk '{print $1}' | sort | uniq -c | sort -nr -k1
}

export -f topSources

LOGFILE=$1
THREADS=$2
LINES=$(wc -l $LOGFILE | awk '{print $1}')
SPLITCOUNT=$(($LINES/$THREADS))
split -l $SPLITCOUNT $LOGFILE SPLIT-
ls -1 SPLIT* | xargs -n1 -P$THREADS bash -c 'topSources "$@" > $1.tmp' _
rm -f result.db
sqlite3 result.db "create table summary(counts integer, ip varchar(20))"
ls -1 *.tmp | while read file; do
sed -i -e 's/^[ ]*//' $file
( cat((EOF
.separator " "
.import $file summary
EOF
) | sqlite3 result.db
done

# substitute (( with 2 less than characters. the actual symbol broke this page.

sqlite3 result.db "select sum(counts), ip from summary group by ip order by 1 desc limit 10"

rm -f *.tmp
rm -f SPLIT*

Loading

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Full text search

Recent Posts

  • Generate secure password
  • AWS Compute Savings Plans
  • AWS Zonal Shift
  • Coffee break…
  • Prevent private key from being committed to git
  • aws (14)
  • coffee (2)
  • headfi (1)
  • linux (9)
  • others (61)
  • security (2)
  • tech (41)
  • terraform (3)
  • wordpress (2)

Loading

apache aws awscli azure backup boot cloud coffee docker ec2 EL8 ElasticBeanstalk espresso featured git kernel lelit linux lvm meltdown MFA nat gateway php power proliant python rdp Redhat RHEL rpm Ryzen scp security smartarray smart switch snapshot spectre tech terraform ubuntu ubuntu upgrade vpn windows wordpress workspace

©2026 blog.headdesk.me | Powered by SuperbThemes