Not all traffic is human. A large share of every website’s requests come from bots — some good (Googlebot, Bing), many bad (content scrapers, vulnerability scanners, spam crawlers, and aggressive AI harvesters). Bad bots inflate your bandwidth bill, skew analytics, hammer the database, and probe for exploits. This guide shows you how to identify them and block them at two layers: .htaccess on the server and Cloudflare at the network edge.
Good Bots vs Bad Bots
The goal isn’t to block all automated traffic — you want search engines to crawl you. The targets are the bots that ignore robots.txt, fake their identity, or request pages far faster than any human would.
| Sign of a bad bot | Why it matters |
|---|---|
Ignores robots.txt | Crawls pages you asked it to skip |
| Empty or fake user-agent | Hides its identity |
| Hundreds of requests per second | Acts as an unintentional DoS |
Hits wp-login.php, xmlrpc.php, /.env | Probing for vulnerabilities |
| Requests from data-centre IP ranges posing as browsers | Scraping at scale |
Layer 1: Blocking with .htaccess (Apache/LiteSpeed)
The quickest server-side defence is to reject known-bad user-agents. Add this to the .htaccess in your site root:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (AhrefsBot|SemrushBot|MJ12bot|DotBot|PetalBot|Bytespider) [NC]
RewriteRule .* - [F,L]
# Block requests with no user-agent at all
RewriteCond %{HTTP_USER_AGENT} ^$
RewriteRule .* - [F,L]
</IfModule>
You can also lock down the files attackers love to probe:
# Disable XML-RPC (a common brute-force and DDoS vector)
<Files xmlrpc.php>
Require all denied
</Files>
# Block access to hidden/sensitive files
<FilesMatch "^\.(env|git|htaccess)">
Require all denied
</FilesMatch>
A caution: user-agent strings are trivially faked, so this stops lazy bots but not determined ones. It’s a useful first filter, not a complete solution — which is where the edge comes in.
Layer 2: Cloudflare at the Edge
Blocking at Cloudflare stops bad traffic before it ever reaches your server, saving CPU and bandwidth. The most powerful tool is a WAF custom rule built from expressions. Examples:
- Block by user-agent:
(http.user_agent contains "Bytespider")→ Action: Block. - Challenge logins:
(http.request.uri.path eq "/wp-login.php")→ Action: Managed Challenge. - Block bad countries/ASNs: match on
ip.geoip.countryorip.geoip.asnumfor ranges you never expect legitimate users from.
Cloudflare’s built-in Bot Fight Mode (free) and Super Bot Fight Mode (paid) use behavioural fingerprinting to catch bots that fake their user-agent — something .htaccess can’t do. Enable these under Security » Bots.
Rate Limiting
Rate limiting caps how many requests a single IP can make in a time window — perfect for stopping scrapers and login-form brute-forcing. In Cloudflare, create a rate-limiting rule such as “more than 20 requests to /wp-login.php in 1 minute → block for 1 hour.” On Nginx you can do the same natively:
limit_req_zone $binary_remote_addr zone=login:10m rate=10r/m;
location = /wp-login.php {
limit_req zone=login burst=5 nodelay;
}
Verify You Didn’t Block the Wrong Thing
After adding rules, confirm legitimate crawlers still get through. Check your access logs for Googlebot and Bingbot 200 responses, and test a rule before trusting it:
# Simulate a blocked bot — should return 403
curl -A "MJ12bot" -I https://example.com/
# Confirm a normal browser still gets 200
curl -A "Mozilla/5.0" -I https://example.com/
Conclusion
Defend in layers: use .htaccess to filter obvious bad user-agents and lock down sensitive files, then let Cloudflare’s bot detection and rate limiting handle the sophisticated traffic at the edge. The payoff is lower server load, a smaller bandwidth bill, cleaner analytics, and far fewer probes reaching your application — all while real search engines keep crawling normally.
