快速上手AWK

AWK 是 Linux 下的一个文本处理工具,在文本处理方面十分强悍,因此常用于日志筛选、切割等场景,网络上有很多关于 AWK 的教程,我比较推荐 AWK 简明教程awk 入门教程,各位学完后应该就明白基础使用了,剩下就需要多加练习,以下是我的练习记录。

假设现在有这样一个 Nginx 日志文件 access.log,我们需要对其进行各种条件的筛选:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
59.36.132.240 - - [30/Mar/2020:22:54:24 +0800] "GET http://152.136.45.36/hudson HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0"
106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /TP/public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)"
106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /TP/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)"
106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /thinkphp/html/public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)"
106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /html/public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)"
106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)"
106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /TP/html/public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)"
106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /elrekt.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)"
106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET /index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)"
106.54.40.23 - - [30/Mar/2020:23:06:22 +0800] "GET / HTTP/1.1" 200 1675 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0;en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6)"
182.254.52.17 - - [30/Mar/2020:23:09:12 +0800] "GET http://152.136.45.36/vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0"
182.254.52.17 - - [30/Mar/2020:23:09:52 +0800] "GET http://152.136.45.36/admin/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0"
182.254.52.17 - - [30/Mar/2020:23:12:48 +0800] "GET http://152.136.45.36/public/index.php HTTP/1.1" 404 146 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0"
94.102.51.8 - - [30/Mar/2020:23:15:15 +0800] "GET /index.html HTTP/1.1" 200 1675 "-" "python-requests/2.23.0"
  1. 获取日志长度 1000 以上的 IP 及日志行号,并以表格形式输出

    1
    $ awk 'length>1000 {print NR, $1}' OFS='\t' access.log
  2. 读取第 5 行日志内容

    1
    $ awk NR==5 access.log
  3. 读取 5~20 行日志内容

    1
    $ awk 'NR>=5 && NR<=20 {print NR, $0}' access.log
  4. 读取第 5 行第一列的内容

    1
    $ awk 'NR==5 {print $1}' access.log
  5. 读取 http 状态码为 500 的前 10 条请求日志

    1
    $ awk '$9==500 {print $0}' access.log | head
  6. 统计日志中的所有状态码及次数,并按降序排列

    1
    $ awk '{print $9}' access.log | sort -nr | uniq -c | sort -nr  

    因为 uniq 只对相邻值去重,因此先通过 sort 排序

  7. 获取某日访问量最高的 IP

    1
    $ cat access.log | grep '30/Jul/2020' | awk '{print $1}' |  sort -nr | uniq -c | sort -nr | head -1 
  8. 获取指定日期范围内 IP 访问最高的 HTTP 状态码

    1
    $ awk '{if($4>"15/Aug/2020:22:22:40" && $4<"17/Aug/2020:22:22:40") print $1}' 
  9. 查看当前目录下有多少后缀为.log 的文件

    1
    $ ll | awk '$9 ~ /\.log/' | wc -l 
  10. 统计8月25日的PV

    1
    $ awk '$4 ~ /25\/Aug\/2020/ {print $0}' access.log | wc -l
  11. 统计8月25日的UV

    1
    $ awk '$4 ~ /25\/Aug\/2020/ {print $1}' access.log | sort | uniq -c | wc -l
因为热爱,所以执着。