Catching 404 Errors

CCarter

Final Boss ®
Moderator
BuSo Pro
Boot Camp
Digital Strategist
Joined
Sep 15, 2014
Messages
4,297
Likes
8,820
Degree
8
In my never ending quest for perfection I created a script that helps me catch 404 errors quickly. It simply reads from the recent access/error logs and then send me a list once a day (using a cronjob at 8:30AM) on bad requests.

It'll get one-off stuff here and there but it will also help you see constant problems that need addressing.

Some of theses simple fixes, especially on large site can translate into recovering lost revenue due to users hitting bad pages. Here is the php version of the code:

404-error.php:

Code:
<?php

$files = ["/var/log/apache2/access.log", "/var/log/apache2/access.log.1"];
$missingPages = [];

foreach ($files as $file) {
    if (file_exists($file) && is_readable($file)) {
        $lines = file($file);
        foreach ($lines as $line) {
            if (strpos($line, '" 404 ')) {
                preg_match('/"GET (.+?) HTTP/', $line, $matches);
                if (isset($matches[1])) {
                    if (!isset($missingPages[$matches[1]])) {
                        $missingPages[$matches[1]] = 0;
                    }
                    $missingPages[$matches[1]]++;
                }
            }
        }
    } else {
        echo "Cannot read file: $file\n";
    }
}

arsort($missingPages);

// You can email this to yourself once a day in the morning or have it posted to a slack channel or other communication channel to monitor

function sendEmailReport($missingPages) {
    $to = 'myemail@compuserve.com';
    $subject = '404 Error Report';
    $message = "404 Error Pages Report:\n\n";
    foreach ($missingPages as $page => $count) {
        $message .= $page . " - Hits: " . $count . "\n";
    }
    $headers = 'From: main_site@example.com' . "\r\n" .
               'X-Mailer: PHP/' . phpversion();

    if (mail($to, $subject, $message, $headers)) {
        echo "Email sent successfully to $to\n";
    } else {
        echo "Failed to send email.\n";
    }
}

// Generate and send the report
sendEmailReport($missingPages);

?>

results:

x9LHqmt.png

It tells me that 11 attempts, me, went to /help/ which doesn't exist. If this was a live site then I use the following command line to figure out what's calling these pages:

cat /var/log/apache2/access.log | grep "404" | grep "help"

That should help narrow down your hunt.

Also instead of emailing it to yourself you can have it post to a slack channel called #404-Errors and so your whole team can see problems as they come in.

From the trenches,
CC
 
I just implemented a similar script based on this idea that also includes the IP, User Agent, and Page Referrer in the report with a daily email sent to the admin. It is extremely useful, so thanks for the idea.

How much do you think 404 errors from Googlebot on nonexistant pages (never existed, never will, sometimes irrelevant) could be affecting a site's SEO?

Do you think it is a negative SEO tactic to point spam links to non-existant URLs to artifically inflate 404 errors and get Googlebot to waste resources on a site? And perhaps to make the site seem "irrelevant" or dilute the authority?

When you find an IP that repeatedly tries to access a bunch of differrent php files and wp-admin files, do you simply block that IP? And block every IP that does this every day? Any further steps you would take?
 
How much do you think 404 errors from Googlebot on nonexistant pages (never existed, never will, sometimes irrelevant) could be affecting a site's SEO?
I think it has zero bearing on SEO. It's something that's out of the webmaster's hands. You can't stop users on the internet from typing out or pasting malformed URLs.

Do you think it is a negative SEO tactic to point spam links to non-existant URLs to artifically inflate 404 errors and get Googlebot to waste resources on a site?
If this was possible and this easy, it'd be a never-ending vector of attack by bad actors, and it's not. While I'm sure there's been attempts, I've never once seen this employed anywhere. I feel assured that someone would have tried it and discovered how simple it was, destroyed a ton of sites rankings, and that word would have spread like wildfire.

When you find an IP that repeatedly tries to access a bunch of differrent php files and wp-admin files, do you simply block that IP? And block every IP that does this every day? Any further steps you would take?
You're better off just blocking access to these files than trying to catch a bunch of rotating proxies. For the WP-Admin you can set up 2FA that blocks the IPs with fail2ban so they can't even get to your login page, which most wp-admin URLs should be redirecting them to anyways. Stops them from trying to brute-force hack into your backend, which inadvertently becomes a DDOS attack. Search for how to create an .htpasswd file. It involves generating a new username and password that's encrypted and stored in a file, that loads nearly zero resources to stop all these requests.
 
When you find an IP that repeatedly tries to access a bunch of differrent php files and wp-admin files, do you simply block that IP? And block every IP that does this every day? Any further steps you would take?

Use fail2ban on the server level. This works by blocking repeated SSH attempts and a ton of various other problems.

Since I'm not on WordPress I use this to find bad actors trying to access files and block their IP Addresses for 4-8+ weeks (my setup is excessive). Within fail2ban you can create a custom setup that will allow you to customize who gets blocked based on the access they are trying.

An example is most hackers/crackers attempt to access files like ".env", ".git", "phpmyadmin" and other stuff to see what's going within your server.

When use view your 404 errors being emailed you'll notice patterns of hackers/crackers which you can then block with fail2ban like I do - as follows...

I've created a custom fail2ban filter called "wp-login.conf" that's as follows:

Code:
[Definition]
failregex = ^<HOST> -.*"(GET|POST) .*/wp-login\.php
            ^<HOST> -.*"(GET|POST) .*/wp-cron\.php
            ^<HOST> -.*"(GET|POST) .*/wp-admin
            ^<HOST> -.*"(GET|POST) .*/wp-includes
            ^<HOST> -.*"(GET|POST) .*/wp-content
            ^<HOST> -.*"(GET|POST) .*/.*xmlrpc.*
            ^<HOST> -.*"(GET|POST) .*\.(env|git|vscode|DS_Store|well-known)
            ^<HOST> -.*"paloaltonetworks
            ^<HOST> -.*"curl/
            ^<HOST> -.*"SemrushBot
            ^<HOST> -.*"Expanse
            ^<HOST> -.*"AhrefsBot
            ^<HOST> -.*"(GET|POST) .*"$  # Matches requests with a blank user-agent
            ^<HOST> -.*"(?i)(GET|POST) .*?(phpmyadmin|myadmin|sqlmanager|dbadmin|mailman|actuator|fckeditor|wysiwyg|filemanager|webroot|administrator|backoffice|slim.min.js|radcontrols|torrent|hetong.js).*"
ignoreregex =

Since I don't have WordPress installed all attempts at going to wordpress directories clearly are bad actors, so they get banned for 4-8+ weeks. If you have Wordpress you need to remove the wordpress directories/files from above.

You have to put this in your jail.local (don't use jail.conf since that gets overwritten upon updates, clone that file and call it jail.local) file:

Code:
#
# Bad Bots
#

[wp-login]
enabled = true
port = http,https
filter = wp-login
logpath = /var/log/apache2/access.log
        /var/log/apache2/randomdomain.access.log
maxretry = 0
bantime = 4838400 # 8 weeks in seconds
findtime = 3600

I also block Amazon AWS's whole IP Range, ipv4 and ipv6, from access my servers - that alone cuts down on hacking attempts by 75%.

The list of Amazon AWS's ip range is available here in the download section: AWS IP Ranges

I block other data center providers too but I'd rather not give away the whole farm.

It's a cat and mouse game trying to block by IP Addresses since most residential IP Addresses are no longer assigned to individual houses/homes due to the IPv4 shortage. Don't do it. You'll find that whole city blocks are working off a single ipv4 address, so don't block innocent people that might have typed in a URL wrong.

This is also why fail2ban defaults to like 4-24 hours instead of week long bans, IP Addresses change a lot.

Also a website owner may have simple mis-typed a URL link to your website that is sending you a ton of traffic and if you start automatically banning IP addresses based off of a massive 404 you might be blocking would be customers/clients/audience members.

Look for the bad 404 errors, and if you find backlinks wrong try reaching out to the website owner or do a redirect to the correct you, you don't want to miss out on going viral cause your firewall and security is too tight.
 
Back