Menu
blog.headdesk.me
blog.headdesk.me

Divide and conquer

Posted on 2016/04/282016/08/24

There are times when I need to analyse Apache’s logs. Some of them can be quite large. Here, I’ll compare several methods and see if there is a faster way of doing so.

Single thread, one-liner

$ time cat access-combined.log  | awk '{print $1}' | sort | uniq -c | sort -nr | head -10
100395 128.199.111.237
14677 69.162.124.229
4054 46.118.124.104
 555 2600:3c03::f03c:91ff:fe93:baf8
 218 176.36.80.39
  76 107.21.1.8
  69 192.99.54.14
  64 208.115.113.85
  63 46.161.9.17
  59 178.137.90.202

real	0m0.227s
user	0m0.225s
sys	0m0.036s

Splitting log file into pieces, and process them in parallel.

$ time ./analyze-log.sh access-combined.log 3
100395|128.199.111.237
14677|69.162.124.229
4054|46.118.124.104
555|2600:3c03::f03c:91ff:fe93:baf8
218|176.36.80.39
76|107.21.1.8
69|192.99.54.14
64|208.115.113.85
63|46.161.9.17
59|178.137.90.202

real	0m0.479s
user	0m0.440s
sys	0m0.303s

Divide and conquer is actually slower considering the input file is relatively small. There are overheads to split the file and consolidate the results with help of a database.

Script to split and process in parallel

#!/usr/bin/env bash

function topSources() {
	cat $1 | awk '{print $1}' | sort | uniq -c | sort -nr -k1
}

export -f topSources

LOGFILE=$1
THREADS=$2
LINES=$(wc -l $LOGFILE | awk '{print $1}')
SPLITCOUNT=$(($LINES/$THREADS))
split -l $SPLITCOUNT $LOGFILE SPLIT-
ls -1 SPLIT* | xargs -n1 -P$THREADS bash -c 'topSources "[email protected]" > $1.tmp' _
rm -f result.db
sqlite3 result.db "create table summary(counts integer, ip varchar(20))"
ls -1 *.tmp | while read file; do
sed -i -e 's/^[ ]*//' $file
( cat((EOF
.separator " "
.import $file summary
EOF
) | sqlite3 result.db
done

# substitute (( with 2 less than characters. the actual symbol broke this page.

sqlite3 result.db "select sum(counts), ip from summary group by ip order by 1 desc limit 10"

rm -f *.tmp
rm -f SPLIT*

facebookShare on Facebook
TwitterTweet

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Full text search

Recent Posts

  • Dumping AWS Organization tree
  • Free is the most expensive
  • Terraform conditional resource and blocks
  • Upgrade Ubuntu 16.04 to latest release
  • Inspect and control network traffic on AWS
  • aws (8)
  • coffee (1)
  • headfi (1)
  • linux (7)
  • others (55)
  • security (2)
  • tech (36)
  • wordpress (2)

apache aws awscli azure backup cloud coffee coreos distributed filesystem docker ec2 EL8 elasticcache etckeeper featured heartbleed kernel linux mail meltdown mysql php pine python rdp rds Redhat Red Hat RHEL RHEL7 rpm Ryzen snapshot spectre SSL systemd tech terraform ubuntu ubuntu upgrade vector vpn wordpress xtreemfs yum

©2022 blog.headdesk.me | Powered by SuperbThemes & WordPress