[WIP] Bad Bots

General Information

What are bad bots?

Bad bots are automated agents that attack websites at the application level. They are crawling the website, creating an unwanted and avoidable load.
To avoid this load, we recommend to block all bots that aren't needed.

Difference between good and bad bots

Good bots observe the Robots exclusion standard, so they read out the robots.txt and adhere to the defined areas that are allowed to be scanned.
Bad bots don't observe robots.txt and can steal data, may break into user accounts, submit junk data through online forms, and perform other malicious activities.
Types of bad bots include credential stuffing bots, content scraping bots, spam bots, and click fraud bots.

Robots.txt

Advantages and disadvantages

The advantage of the robots.txt is that good bots stick to the Robots exclusion standard, so they read out the robots.txt and adhere to the defined areas that are allowed to be scanned.
You should use your robots.txt to prevent duplicate content from appearing in Search Engine Result Pages or just to keep entire sections of a website private.

The disadvantage is that bad bots ignore these robots.txt.
Therefore the robots.txt can only be used to restrict the access of the good bots to the site.

How to use the robots.txt

Create the robots.txt in your document root like in the following example:

User-agent: examplebot
Crawl-delay: 120
Disallow: /mobile/
Disallow: /api/
Disallow: /plugin/
User-agent: foobot
Disallow: /
User-agent: barbot
Crawl-delay: 30

To see how to use the robots.txt, visit the official robotstxt.org site.

Deny List via .htaccess

If you want to block specific bots via your .htaccess you could do it like in the following example:

#====================================================================================================
# Block BadBots
#====================================================================================================
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(SCspider|Textbot|s2bot).*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*AnotherBot.*$ [NC]
RewriteRule .* - [F,L]

Block bots with nginx

If you use nginx as your webserver, you cannot use .htaccess and we have to block these bots for you.
To do this, you create a ticket in our ticket system or send an email to service@root360.de

We already have a ready-made pattern for this scenario, which we can adapt and set up for your environment.

Projects with Content Delivery Network

As soon as you use a CDN which is not a Media CDN, we advise to use a Web Application Firewall (WAF) to block these bots and prevent your application for other attacks such sql injection or cross site scripting. To find out the full range of functions of the WAF, visit AWS WAF

The Web Application Firewall is the preferred solution for this szenario and has the advantage that the requests doesn't reached the web server.

If you are interested about a WAF create a ticket or send an email to service@root360.de.
We will advise you about the WAF service and we will build an optimal solution for your project.

Application Plugins

Wordpress

If you use Wordpress, you can use Blackhole for Bad Bots.
With this tool you have the possibility to block bad bots directly via the application.

Magento 2

If you use Magento 2, you can use Spam Bot Blocker by Nublue.
With this blocker you can block user agents, single IP addresses or ranges of IP addresses.
You can simply work with it from the backend and don't have to do this on the command line.

This solution is on application level. Please note, that the request has already passed through the webserver and in most cases php-fpm. This means that resources have already been used up before the request is blocked.
We do not prefer the solution, but wanted to mention it for the sake of completeness.

Always test your work

To validate whether the bot was blocked successfully, you can curl your website and specify the blocked user agent.
You can find out how to do this in the following example:

Blocked user agent:

curl -I https://www.domain.tld -A "SemrushBot -BA" 

HTTP/2 403 
date: Tue, 01 Sep 2020 15:03:24 GMT
content-type: text/html
content-length: 162
server: nginx

Accepted user agent:

curl -I https://www.domain.tld -A "googlebot" 

HTTP/2 200 
date: Tue, 01 Sep 2020 15:03:05 GMT
content-type: text/html; charset=utf-8
server: nginx
vary: Accept-Encoding
set-cookie: eZSESSID=rr9kmbeqanomb9v1ht6ame3kn6; expires=Mon, 28-Jun-2021 15:03:05 GMT; Max-Age=25920000; path=/
expires: Mon, 26 Jul 1997 05:00:00 GMT
last-modified: Tue, 01 Sep 2020 15:03:05 GMT
cache-control: no-cache, must-revalidate
pragma: no-cache
served-by: *
content-language: en-GB