Bash scripts to process logs from the website
Counting the requests
What is the HTTP request?
The HTTP request is the web request with method (GET, HEAD, POST, etc.), the path, and respond code, response size.
Examples
1xx.1xx.1.61 - - [14/Oct/2010:17:29:22 +0800] "GET / HTTP/1.1" 200 1251 "-" "python-requests/2.21.0" "1xx.1xx.1.61"
-> 1 GET request, with http return code = 200, payload size = 1251
66.2xx.73.71 - - [29/Nov/2010:12:19:57 +0800] "GET /wp-content/plugins/wp-syntaxhighlighter/syntaxhighlighter3/scripts/shAutoloader.js?ver=3.0 HTTP/1.1" 404 125 "https://blog.ccp.net/2016/07/postgresql-drop-database/324/" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Safari/537.36" "66.2xx.73.71"
-> 1 GET request, with http return code = 404, payload size = 125
Usages
chmod +x ./count_http_requests.sh
bash ./count_http_requests.sh test.log
./q1/count_http_requests.sh test.log
Examples output:
bash ./count_http_requests.sh ./data/test.log
================================================================================
Running script: ./count_http_requests.sh ./data/test.log
--------------------------------------------------------------------------------
There are 86400 http requests in the logfile
Most important lines:
num_direct_requests=$(ps -ef | awk '{print $6" "$7}' $logfile | tr -d \" | wc -l )
other_links=$(ps -ef | awk 'match($0,"\"https?://[^\"]+") {print substr($0,RSTART+1,RLENGTH-1) }' $logfile | wc -l )
numrequests=$(($num_direct_requests + 0))
For full script, please let me know.
Finding top x requests during the period of time
Notes
- Command
date
with -d option is not working in zsh (MacOS’s terminal) so you need to installqdate
before running the script - The input dates need to follow the convention
Usages
chmod +x ./find_top10_during_pot.sh
bash ./find_top10_during_pot.sh test.log '2010-10-14 00:00:00' '2010-11-14 23:59:59'
Output:
bash ./find_top10_during_pot.sh ./data/test.log '2010-10-14 00:00:00' '2010-11-14 23:59:59'
================================================================================
Running script: ./find_top10_during_pot.sh ./data/test.log 2010-10-14 00:00:00 2010-11-14 23:59:59
--------------------------------------------------------------------------------
Top 10 hosts with most requests from: 14/Oct/2010:00:00:00 to and including: 14/Nov/2010:23:59:59
#Occurences ---- HostIP
530 1xx.24.71.239
530 x.222.44.52
523 2xx.29.129.76
386 1xx.251.xxx.137
340 9x.216.38.1xx
340 1xx.243.70.151
337 x.189.xxx.208
337 2xx.239.xx.194
336 x.9.71.2xx
306 x.9.108.xx4
For the full script, please let me know.
Finding the country with the most requests
Notes
- Neeed to install
jq
to search and print the output in JSON format - ipinfo.io and hostinfo.io might not working all the time.
- There is no country information in the log, so the best is to map from the IP address to the country. However, user can use proxy server/VPN to make requests, which lead to inaccurate country information
Usages
chmod +x ./find_country_mostrequests.sh
bash ./find_country_mostrequests.sh test.log
Example output:
bash ./find_country_mostrequests.sh ./data/test.log
================================================================================
Running script: ./find_country_mostrequests.sh ./data/test.log
--------------------------------------------------------------------------------
Find the country that made the most HTTP requests
Country code: "DE"
Country name: "GERMANY"
For the full script, please let me know.
Happy bash scripts coding!