Bash scripts to process logs from the website

Counting the requests

What is the HTTP request?

The HTTP request is the web request with method (GET, HEAD, POST, etc.), the path, and respond code, response size.

Examples

1xx.1xx.1.61 - - [14/Oct/2010:17:29:22 +0800] "GET / HTTP/1.1" 200 1251 "-" "python-requests/2.21.0" "1xx.1xx.1.61"

-> 1 GET request, with http return code = 200, payload size = 1251


66.2xx.73.71 - - [29/Nov/2010:12:19:57 +0800] "GET /wp-content/plugins/wp-syntaxhighlighter/syntaxhighlighter3/scripts/shAutoloader.js?ver=3.0 HTTP/1.1" 404 125 "https://blog.ccp.net/2016/07/postgresql-drop-database/324/" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Safari/537.36" "66.2xx.73.71"

-> 1 GET request, with http return code = 404, payload size = 125

Usages

chmod +x ./count_http_requests.sh
bash ./count_http_requests.sh test.log
./q1/count_http_requests.sh test.log

Examples output:

bash ./count_http_requests.sh ./data/test.log 
================================================================================
Running script: ./count_http_requests.sh ./data/test.log
--------------------------------------------------------------------------------
There are    86400 http requests in the logfile

Most important lines:

num_direct_requests=$(ps -ef | awk '{print $6" "$7}' $logfile | tr -d \" | wc -l )
other_links=$(ps -ef | awk 'match($0,"\"https?://[^\"]+") {print substr($0,RSTART+1,RLENGTH-1) }' $logfile | wc -l )
numrequests=$(($num_direct_requests + 0))

For full script, please let me know.

Finding top x requests during the period of time

Notes

  • Command date with -d option is not working in zsh (MacOS’s terminal) so you need to install qdate before running the script
  • The input dates need to follow the convention

Usages

chmod +x ./find_top10_during_pot.sh
bash ./find_top10_during_pot.sh test.log '2010-10-14 00:00:00' '2010-11-14 23:59:59'

Output:

bash ./find_top10_during_pot.sh ./data/test.log '2010-10-14 00:00:00' '2010-11-14 23:59:59'
================================================================================
Running script: ./find_top10_during_pot.sh ./data/test.log 2010-10-14 00:00:00 2010-11-14 23:59:59
--------------------------------------------------------------------------------
Top 10 hosts with most requests from: 14/Oct/2010:00:00:00 to and including: 14/Nov/2010:23:59:59
#Occurences ---- HostIP
530            1xx.24.71.239
530            x.222.44.52
523            2xx.29.129.76
386            1xx.251.xxx.137
340            9x.216.38.1xx
340            1xx.243.70.151
337            x.189.xxx.208
337            2xx.239.xx.194
336            x.9.71.2xx
306            x.9.108.xx4

For the full script, please let me know.

Finding the country with the most requests

Notes

  • Neeed to install jq to search and print the output in JSON format
  • ipinfo.io and hostinfo.io might not working all the time.
  • There is no country information in the log, so the best is to map from the IP address to the country. However, user can use proxy server/VPN to make requests, which lead to inaccurate country information

Usages

chmod +x ./find_country_mostrequests.sh
bash ./find_country_mostrequests.sh test.log

Example output:

bash ./find_country_mostrequests.sh ./data/test.log 
================================================================================
Running script: ./find_country_mostrequests.sh ./data/test.log
--------------------------------------------------------------------------------
Find the country that made the most HTTP requests
Country code: "DE"
Country name: "GERMANY"

For the full script, please let me know.

Happy bash scripts coding!

Comment Disabled for this post!