Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Excerpt
hiddentrue

Bad bots are automated agents that attack websites at the application level. They are crawling the website, creating an unwanted and avoidable load. To avoid this load, we recommend to block these bots.

What are bad bots?

Bad bots are automated agents that attack websites at the application level. They are crawling the website, creating an unwanted and avoidable load. To avoid this load, we recommend to block these bots.

Difference between good and bad bots

Good bots observe the robots.txt exclusion standard: they read out the robots.txt and adhere to the defined areas that are allowed to be scanned.
Bad bots don't observe robots.txt and can steal data, may break into user accounts, submit junk data through online forms, and perform other malicious activities.
Types of bad bots include credential stuffing bots, content scraping bots, spam bots, and click fraud bots.

Robots.txt

Advantages and disadvantages

The advantage of the robots.txt is that good bots stick to the Robots exclusion standard, so they read out the robots.txt and adhere to the defined areas that are allowed to be scanned.
You should use your robots.txt to prevent duplicate content from appearing in search engine result pages or just to keep entire sections of a website private from good bots.

The disadvantage is that bad bots ignore these robots.txt-files.
Therefore the robots.txt can only be used to restrict the access of the good bots to the site.

How to use the robots.txt

Create a file called robots.txt in your document root. To see how to use the robots.txt, visit the official robotstxt.org site. Here is an example robots.txt managing access for some bots that are identified by their user agent strings:

Code Block
User-agent: examplebot
Crawl-delay: 120
Disallow: /mobile/
Disallow: /api/
Disallow: /plugin/
User-agent: foobot
Disallow: /
User-agent: barbot
Crawl-delay: 30

Block bots when using Apache: deny list via .htaccess

If your environment uses Apache as its web server, you can block specific bots via your .htaccess-files like in this example:

Code Block
#====================================================================================================
# Block BadBots
#====================================================================================================
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(SCspider|Textbot|s2bot).*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*AnotherBot.*$ [NC]
RewriteRule .* - [F,L]

Block bots when using Cloudfront as a full page cache

When you use Cloudfront as a full page cache to deliver all of your application’s assets (in contrast to only using it with a subdomain as a media content delivery network), we advise to use AWS’ Web Application Firewall (WAF) to block these bots and protect your application against other attacks such sql injection or cross site scripting. To find out the full range of functions of the WAF, visit WAF (AWS Web Application Firewall) .

A WAF is our preferred solution for this scenario and has the advantage that blocked requests cannot reach the application instances.

If you are interested in setting up a WAF create a ticket or send an email to service@root360.de.

Using application plugins to block bots

Wordpress

If you use Wordpress, you can use Blackhole for Bad Bots. With this tool you can block bad bots directly via the application.

Magento 2

If you use Magento 2, you can use Spam Bot Blocker by Nublue. With this blocker you can block user agents, single IP addresses or ranges of IP addresses. You can simply work with it from the backend and don't have to do this on the command line.

Info

These solutions are on the application level. Please note that the request has already passed through the webserver and in most cases was processed by php-fpm. This means that resources have already been used up before the request is blocked.
We do not recommend the solution, but mention it for the sake of completeness.

Verify blocking rules

To check whether a bot is blocked or is still allowed you can use curl to access your website with a custom user agent. See how to do this in the following examples:

Blocked user agent:

Code Block
curl -I https://www.domain.tld -A "SemrushBot -BA" 

HTTP/2 403 
date: Tue, 01 Sep 2020 15:03:24 GMT
content-type: text/html
content-length: 162
server: nginx

Accepted user agent:

Code Block
curl -I https://www.domain.tld -A "googlebot" 

HTTP/2 200 
date: Tue, 01 Sep 2020 15:03:05 GMT
content-type: text/html; charset=utf-8
server: nginx
vary: Accept-Encoding
set-cookie: eZSESSID=rr9kmbeqanomb9v1ht6ame3kn6; expires=Mon, 28-Jun-2021 15:03:05 GMT; Max-Age=25920000; path=/
expires: Mon, 26 Jul 1997 05:00:00 GMT
last-modified: Tue, 01 Sep 2020 15:03:05 GMT
cache-control: no-cache, must-revalidate
pragma: no-cache
served-by: *
content-language: en-GB

Related

Tutorials

tutorials

Filter by label (Content by label)
showLabelsfalse
max10
sorttitle
showSpacefalse
cqllabel in ( "bots" , "loadbalancing" , "waf" , "security" ) and ancestor = "14756413452014352487" and space = currentSpace ( )

Related

Components

components

Filter by label (Content by label)
showLabelsfalse
max10
sorttitle
showSpacefalse
cqllabel in ( "bots" , "security" , "loadbalancing" , "waf" ) and ancestor = "14749532172014350220" and space = currentSpace ( )

Status
colourRed
titleExpert

Table of Contents
exclude(Related * | Recommended * |Table of contents).*


Filter by label (Content by label)
showLabelsfalse
max10
sorttitle
showSpacefalse
titleRelated questions
cqllabel in ( "security" , "waf" , "bots" ) and ancestor = "15395390302014351598" and space = currentSpace ( )