AWK 是 Linux 下的一个文本处理工具,在文本处理方面十分强悍,因此常用于日志筛选、切割等场景,网络上有很多关于 AWK 的教程,我比较推荐 AWK 简明教程 和 awk 入门教程,各位学完后应该就明白基础使用了,剩下就需要多加练习,以下是我的练习记录。
假设现在有这样一个 Nginx 日志文件 access.log,我们需要对其进行各种条件的筛选:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| 59.36.132.240 - - [30/Mar/2020:22:54:24 +0800] "GET http://152.136.45.36/hudson HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0" 106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /TP/public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)" 106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /TP/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)" 106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /thinkphp/html/public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)" 106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /html/public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)" 106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)" 106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /TP/html/public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)" 106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /elrekt.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)" 106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)" 106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET / HTTP/1.1" 200 1675 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)" 182.254.52.17 - - [30/Mar/2020:23:09:12 +0800] "GET http://152.136.45.36/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0" 182.254.52.17 - - [30/Mar/2020:23:09:52 +0800] "GET http://152.136.45.36/admin/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0" 182.254.52.17 - - [30/Mar/2020:23:12:48 +0800] "GET http://152.136.45.36/public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0" 94.102.51.8 - - [30/Mar/2020:23:15:15 +0800] "GET /index.html HTTP/1.1" 200 1675 "-" "python-requests/2.23.0"
|
获取日志长度 1000 以上的 IP 及日志行号,并以表格形式输出
1
| $ awk 'length>1000 {print NR, $1}' OFS='\t' access.log
|
读取第 5 行日志内容
读取 5~20 行日志内容
1
| $ awk 'NR>=5 && NR<=20 {print NR, $0}' access.log
|
读取第 5 行第一列的内容
1
| $ awk 'NR==5 {print $1}' access.log
|
读取 http 状态码为 500 的前 10 条请求日志
1
| $ awk '$9==500 {print $0}' access.log | head
|
统计日志中的所有状态码及次数,并按降序排列
1
| $ awk '{print $9}' access.log | sort -nr | uniq -c | sort -nr
|
因为 uniq 只对相邻值去重,因此先通过 sort 排序
获取某日访问量最高的 IP
1
| $ cat access.log | grep '30/Jul/2020' | awk '{print $1}' | sort -nr | uniq -c | sort -nr | head -1
|
获取指定日期范围内 IP 访问最高的 HTTP 状态码
1
| $ awk '{if($4>"15/Aug/2020:22:22:40" && $4<"17/Aug/2020:22:22:40") print $1}'
|
查看当前目录下有多少后缀为.log 的文件
1
| $ ll | awk '$9 ~ /\.log/' | wc -l
|
统计8月25日的PV
1
| $ awk '$4 ~ /25\/Aug\/2020/ {print $0}' access.log | wc -l
|
统计8月25日的UV
1
| $ awk '$4 ~ /25\/Aug\/2020/ {print $1}' access.log | sort | uniq -c | wc -l
|