If you run a website, a chunk of your traffic is not human. Depending on who you ask and how you measure it, somewhere between 30% and 50% of all web traffic is automated. Vulnerability scanners probing for .env files. Scrapers pulling content. Credential stuffing bots hammering login pages. Crawlers that serve no purpose anyone can identify.
Most of this traffic is harmless in isolation. A single bot request takes a few milliseconds to process. But multiply that across thousands of IPs, millions of requests, and every website on a hosting platform, and it adds up fast. On our servers, automated traffic accounted for over 35% of total CPU time during peak hours, all spent serving requests that had no legitimate purpose.
On most websites we host, this system stops 20-40% of all incoming traffic before it ever reaches the application. That is traffic that was consuming CPU, memory, and I/O for no reason.
This post explains what we built to deal with it. We will be specific about what works, honest about what does not, and clear about the tradeoffs we made.
Signs bot traffic is wasting your server resources#
Before we get into the system itself, here is what bot traffic actually looks like from the server side. If any of these sound familiar, automated traffic is likely a factor:
- Unexplained CPU spikes that do not correlate with real visitor growth. Bots hammering PHP endpoints consume the same resources as real users.
- Slow page loads under moderate traffic. If your site struggles with 200 concurrent visitors when the hardware should handle 2,000, something else is eating capacity.
- Excessive wp-login.php or xmlrpc.php requests in your access logs. Credential stuffing bots target these endpoints relentlessly.
- Hundreds of 404 errors for paths that do not exist on your site, like /solr/admin, /.env, or /wp-config.php.bak. These are vulnerability scanners.
- High bandwidth usage with low engagement. If your analytics show 10,000 daily visitors but your server logs show 50,000 requests, the gap is bots.
- Database connection exhaustion during what should be off-peak hours. Bots do not sleep.
None of these are definitive on their own. But if you are seeing multiple symptoms, the odds are good that automated traffic is consuming a significant share of your server resources.
How do you stop bot traffic at the origin?#
To stop bot traffic at the origin server level before it reaches your application, you need several layers working together:
- Real-time rate detection that catches volume-based attacks in microseconds
- Behavioral scoring across multiple independent signals like request patterns, cookie behavior, and TLS fingerprints
- Honeypot traps that instantly identify vulnerability scanners by the paths they probe
- Proof-of-work challenges that verify a real browser is making the request
- Threat intelligence feeds from curated blocklists, updated daily
- Cross-server reputation sharing so a ban on one server applies everywhere within 30 seconds
- Automatic score decay to reduce false positives and release clean IPs over time
That is the system we built. The rest of this post explains how each piece works and where the limits are.
Is this a replacement for Cloudflare?#
No. This is not a replacement for Cloudflare, Sucuri, or any other CDN-based security product. Those services work at a different layer. They sit in front of your origin server and filter traffic before it reaches you. They have massive global networks, decades of threat data, and teams of security researchers. We are not competing with that and we are not trying to.
What we are doing is solving a different problem. If your site uses Cloudflare, great – though make sure Cloudflare is properly configured to reach your origin. The traffic that reaches us has already been filtered by them. But even filtered traffic contains bots. CDN bot detection is designed for a broad range of websites and threat profiles. It cannot know the specific patterns that matter for your particular site on our particular infrastructure. And if your site does not use a CDN at all, then every request hits your origin server directly, and that is where our system picks up.
This is origin-level bot protection. It runs on our edge servers, the same servers that host your websites, using signals that are only visible at the origin. It blocks bot traffic using patterns that are specific to our platform, patterns that a CDN sitting thousands of miles away simply cannot see.
How we decide what is a bot#
Every IP that sends requests to websites on our platform gets a threat score between 0 and 100. The score is computed from a dozen independent signals:
| Signal | What it catches |
|---|---|
| Rate limit violations | IPs repeatedly exceeding request rate thresholds |
| Scanner patterns | Requests for .env, wp-config.php, phpinfo, and other probe paths |
| Honeypot traps | Requests for paths no legitimate visitor would ever access |
| Multi-target spread | Same IP hitting multiple unrelated websites |
| Repeat offender history | IPs that were previously banned and came back |
| Regional traffic patterns | Requests from regions with historically elevated bot ratios |
| User agent analysis | Suspicious patterns like a single UA making hundreds of requests |
| Login targeting | Repeated attempts on wp-login.php, xmlrpc.php, admin panels |
| Blocklist presence | IP appears on curated threat intelligence feeds |
| VPN/proxy range | IP belongs to a known VPN or proxy provider |
| Cookie behavior | Whether the client accepts and returns cookies |
| Challenge outcomes | Whether the IP has solved or failed proof-of-work challenges |
Each signal has a maximum contribution. No single signal can push an IP past an action threshold on its own. We designed it this way deliberately. Someone using a VPN should not get challenged just because they are on a VPN. But a VPN IP that is also hitting rate limits, scanning for config files, and ignoring cookies? That is a different story.
Based on the total score, each IP falls into one of four categories:
- 0-20 (Clean): No action taken.
- 21-50 (Suspicious): Rate limiting applied, full request logging enabled.
- 51-80 (Malicious): Proof-of-work challenge served before the request reaches your site.
- 81-100 (Blocked): Banned across all servers. Flat 403 Forbidden.
The scoring engine runs on a five-minute cycle. It aggregates signals, recalculates scores, and pushes updated enforcement lists to every server in the fleet within 30 seconds.
In practice, most automated traffic accumulates enough signals to get stopped within a few scoring cycles, before it has a chance to waste meaningful resources.
What happens at the edge#
The scoring engine is powerful but it is inherently reactive. It needs data to accumulate before it can make a decision. That is fine for persistent threats, but it means a brand new IP can send its first few requests without the backend even knowing it exists.
So we run a second layer of detection directly on each web server, inline with every request, before PHP or any application code runs. This is not a database lookup. It runs in memory, evaluating each request in microseconds using signals that are available right now:
- Current request rate for this IP
- Whether standard browser headers are present
- Whether the User-Agent matches known automation libraries
- The TLS fingerprint of the connection
- Whether the requested path matches known vulnerability probes
- Whether the client has returned our security cookie
If enough of these signals fire together, the server serves a proof-of-work challenge on the spot. It never bans based on this layer alone. The worst case for a false positive is a 2-second delay while the visitor’s browser solves the challenge.
This edge layer also feeds data back to the scoring engine. When it challenges or escalates an IP, that event is recorded and factored into the next scoring cycle. The two layers reinforce each other: the edge handles the immediate response, the backend builds the long-term picture.
The proof-of-work challenge#
When the system decides an IP needs to be verified, it serves a challenge page instead of your website. The page contains a computational puzzle that runs in the browser. A real browser with JavaScript solves it automatically in about 2 seconds. The visitor sees a brief loading screen, the puzzle completes, a signed cookie is set, and they are redirected to the page they originally requested.
This is not a CAPTCHA. There are no distorted letters, no traffic lights to click, no “prove you are human” checkboxes. It runs silently in the background.
Most bots cannot solve this. Simple HTTP clients like curl, wget, or Python’s requests library do not execute JavaScript at all. They see the challenge page as HTML and either give up or keep requesting the same page in a loop, which only makes their score climb faster. If a request cannot execute JavaScript, it never reaches your site.
[SCREENSHOT: The challenge page as seen by a visitor]
What about headless browsers?#
This is the question that comes up most often in technical discussions, and it deserves an honest answer.
Yes, tools like Puppeteer and Playwright can solve JavaScript challenges. They run a real browser engine, they execute JavaScript, and they can interact with pages in ways that look human. A well-configured headless Chrome instance with the right stealth plugins will solve our proof-of-work challenge just as easily as a regular browser.
We are not going to pretend otherwise.
But here is the practical reality. The vast majority of bot traffic we see on our platform is not headless browsers. It is curl. It is Python scripts. It is custom HTTP clients written in Go or Java that blast requests as fast as possible. It is vulnerability scanners running through lists of known exploit paths. These tools do not run JavaScript, and the proof-of-work challenge stops them completely.
For the small percentage of traffic that does use headless browsers, the challenge alone is not enough. But the challenge is not the only layer. The scoring engine still sees the behavioral patterns: the request timing, the path distribution, the cookie behavior, the TLS fingerprint. A headless browser that scrapes 500 pages in 10 minutes still triggers rate detection. A Puppeteer instance probing /.env still hits a honeypot trap. And running a headless browser is expensive. It consumes real compute resources on the attacker’s side. Most bot operators are not going to spin up a full Chrome instance for every request when they can hit thousands of other sites with a simple HTTP client for free.
The challenge raises the cost. The scoring engine catches the patterns that survive it. Together they block bot traffic that we actually see on our servers. We are not claiming to stop every possible attack vector. We are claiming to stop the ones that actually waste your server resources in practice. For a practical example of how these layers escalate against a real attack, see our WordPress login protection walkthrough.
Honeypot traps#
There are paths on any web server that a legitimate visitor would never request. No human types /.env or /.git/config into their browser. If someone requests these paths, they are probing for vulnerabilities.
We maintain a list of these trap paths, split into two tiers. Critical traps trigger on the first request. If an IP requests /.env, we do not wait to see what it does next. That single request tells us everything we need to know. Standard traps require two hits in a time window before triggering, covering paths that are suspicious but could theoretically be accidental.
When a trap triggers, the IP is immediately served a challenge, regardless of its current score. Intent matters. One request to /.git/config is more telling than a hundred normal page views.
WordPress paths like wp-login.php and wp-admin are explicitly excluded. Those are real paths that real people use.
How this protects WordPress sites#
WordPress bot protection requires understanding the specific attack surface of the platform. Most of the websites we host run WordPress, and that means we see every common WordPress attack pattern daily.
How to stop wp-login brute force attacks
Automated credential stuffing using leaked password lists is one of the most common attacks we see. These generate hundreds of POST requests per hour from rotating IPs. The scoring engine detects repeated wp-login.php failures through the login targeting signal, and combined with rate detection, the attacking IPs are challenged or banned within a few scoring cycles. We walk through a full credential stuffing attack scenario, from first request to platform-wide ban, in What Happens When Bots Find Your WordPress Login Page.
How to block xmlrpc.php abuse
The XML-RPC endpoint is a favorite for amplification attacks and brute force because it allows multiple login attempts in a single request. Our rate detection catches the volume patterns, and the behavioral scoring flags the repetitive POST-only traffic pattern. These IPs accumulate signals fast. You can also disable xmlrpc.php entirely from the WordPress Security tab in your control panel – see our WooCommerce security guide for the full list of server-level WordPress hardening options.
Plugin scanning and wp-cron abuse
Beyond login attacks, we see automated tools cycling through known CVEs for popular plugins, probing paths like /wp-content/plugins/revslider/ or /wp-content/plugins/wp-file-manager/. These hit honeypot traps and scanner detection. Bots also hammer wp-cron.php repeatedly, triggering unnecessary background processing. Both patterns generate scoring signals that compound quickly.
Protecting legitimate WordPress admins
Importantly, legitimate WordPress admin sessions are explicitly protected. Logged-in users bypass rate limiting and edge detection entirely. We identify them by their WordPress authentication cookies, so admins clicking through the dashboard at normal speed are never affected.
This server-level layer is designed to complement application-level security plugins like Wordfence, not replace them. For a detailed breakdown of what each layer handles and why both matter, see Wordfence and server-level security: why you need both.
The cookie test#
Most real browsers accept and return cookies. Most bots do not bother.
Every response includes a small security cookie. On subsequent requests, we check whether it comes back. An IP that sends hundreds of requests and never returns the cookie is almost certainly not running a real browser. This signal contributes to the overall score.
This is a strictly necessary security cookie used to verify browser behavior. It does not contain names, emails, or other direct identifiers, and is used only for abuse prevention.
Threat intelligence feeds#
We pull in external data from several curated sources and sync them daily:
- Firehol Level 1: IPs actively involved in attacks and abuse
- Tor exit nodes: Tor traffic is not inherently malicious, but combined with other signals it is worth tracking
- VPN provider ranges: About 7,000 CIDR ranges from commercial VPN services. This gets a lower score weight than the others because VPN usage is common among legitimate users
These feeds contribute points to the score but cannot trigger enforcement on their own. An IP on the Firehol list that is browsing your site normally will pick up some points but will not get challenged unless other signals also fire.
Updated threat data reaches every server within 30 seconds of being processed.
Search engine protection#
Bot detection systems that accidentally block Googlebot are worse than useless. We spent significant time on this.
We maintain a list of over 25 verified search engine bots and crawlers. But we do not trust User-Agent headers alone. Anyone can set their User-Agent to “Googlebot”. We verify both the User-Agent claim and the source IP against the officially published IP ranges for each bot.
Only IPs that pass both checks are exempted. A fake Googlebot from a random datacenter IP gets treated like any other request.
The verified ranges are synced daily from a community-maintained database. If the source data looks corrupted or suspiciously small, the update is rejected automatically. We go deeper into why this matters for your SEO and what happens to fake bots in How We Verify Search Engine Bots.
Edge rate detection#
Before any scoring happens, our servers run real-time rate detection per IP. This catches the fast, obvious attacks that do not need behavioral analysis:
- High volume: Proof-of-work challenge served
- Very high volume: 15-minute ban
- Extreme volume: 60-minute ban
This runs entirely in memory. No database calls, no waiting for a scoring cycle. When an IP is blasting hundreds of requests per second, you want it stopped in microseconds, not minutes.
Logged-in WordPress users bypass this layer. We do not want admins getting rate limited for clicking around their dashboard. Rate detection is essential but it is only one layer. We explain why rate limiting alone is not enough and what fills the gaps in a separate post.
Adaptive scoring#
The rule-based scoring engine handles the majority of bot traffic effectively. But rules are static. They encode our understanding of what bot traffic looks like at a specific point in time.
To handle the cases that fall between the rules, the system includes an adaptive scoring layer that learns from real outcomes. When an IP solves a challenge, that is evidence of a real browser. When an IP is served a challenge and never solves it, that is evidence of a bot. Over time, these outcomes refine the system’s understanding of which combinations of signals actually predict bot behavior versus human behavior.
This layer can adjust an IP’s score by up to 20 points in either direction. It cannot ban an IP that the rules think is clean, and it cannot protect an IP that the rules think is dangerous. The rules always have the final word at the extremes. The adaptive layer operates in the gray zone where behavioral patterns matter most.
It also has a kill switch. If anything goes wrong, it can be disabled with a single configuration change and the system falls back to pure rule-based scoring immediately.
We are deliberately understating this component. It works, it is improving, and it catches patterns that static rules miss. But it is not magic. It is statistical pattern matching trained on real traffic data, with hard limits on how much influence it can have. The rule engine does the heavy lifting.
How the layers reinforce each other#
The most effective part of the system is not any single layer. It is how they interact.
A new IP visits your site. The system has never seen it. It passes the edge checks and is randomly selected for a challenge (we sample a percentage of unknown traffic to build behavioral data). If the visitor is human, their browser solves the challenge in 2 seconds and they continue normally.
If it is a bot, it fails the challenge. The failed challenge is recorded. On the next scoring cycle, the IP picks up points for having an unsolved challenge. That pushes it into the suspicious tier, which means every subsequent request from that IP gets challenged. Not randomly. Every time.
Now the bot has two choices. Solve the challenge (most cannot) or keep failing (which makes the score climb further). More failures, more signals, higher score, stricter enforcement. Eventually it crosses into the blocked tier and gets banned across every server.
The system tightens automatically. It does not loosen until the traffic proves it should.
Per-account IP whitelisting#
Sometimes the scoring engine flags a legitimate IP. Maybe it is your office IP, a monitoring service, or a payment processor callback. You can whitelist specific IPs from the control panel.
Whitelisted IPs bypass all enforcement on your websites. They still get scored globally (an IP legitimate for your site might be attacking someone else’s), but enforcement is skipped for your domains.
You can also whitelist /24 subnets for office networks or services that rotate through a block of addresses.
Lockdown mode#
When your site is under active attack and you need to buy time, lockdown mode challenges every single visitor. No exceptions based on score. Everyone gets a proof-of-work challenge.
This is intentionally aggressive. You would not leave it on for normal operations. But when the scoring engine is still catching up to a coordinated attack, lockdown stops everything at the door while the system builds its picture.
You can add exemptions for specific IPs, countries, or verified search engine bots.
Cross-server propagation#
When an IP gets banned on one server, that ban propagates to every server in the fleet within 30 seconds. Attackers cannot escape a ban by targeting a different server.
The same applies to challenges and threat scores. Threat intelligence data is shared across our infrastructure solely for security purposes.
Score decay#
Scores are not permanent. An IP that stops sending suspicious traffic sees its score decay:
- After 24-72 hours of inactivity: 20% reduction
- After 72+ hours: 50% reduction
This ensures legitimate IPs that triggered a false positive recover within days. Dynamic IPs that get reassigned do not carry the previous user’s penalty indefinitely.
What this means for you#
All of this runs automatically. You do not need to configure anything.
From your control panel, you can see the threat details for any IP including which signals contributed to its score. You can whitelist IPs or subnets if you believe a legitimate visitor was incorrectly challenged. Your dashboard shows a traffic quality percentage that tells you how much of your traffic the system identified as bot versus human.
[SCREENSHOT: Signal breakdown for a specific IP]
What we do not catch#
No bot detection system catches everything. Ours does not either. Here is what gets through:
Sophisticated headless browsers with stealth configurations that solve challenges, rotate IPs, and mimic human browsing patterns. These exist. They are expensive to operate and rare in practice, but they exist.
Low-and-slow bots that send one or two requests per hour with perfect browser signatures. If a bot looks exactly like a human and behaves exactly like a human, there is no signal to detect.
Zero-day attack patterns that the rules have not been written for yet. The adaptive scoring layer helps here, but it takes time to learn new patterns.
We are not going to claim we stop 100% of bot traffic. That would be dishonest. What we stop is the 90-95% that wastes measurable server resources: the scanners, the brute forcers, the scrapers, the credential stuffers, and the bulk of automated traffic that every hosting platform deals with.
That is the traffic that slows down your sites, consumes server capacity, and costs real money. And that is what this system is built to handle.
Privacy and data use#
Our bot detection system processes IP addresses and request metadata to protect websites and infrastructure from abuse and automated attacks. This processing is based on legitimate interests (security) and is limited strictly to threat detection and mitigation.
Some decisions, such as rate limiting, challenges, or temporary blocking, are made automatically based on behavioral signals. These actions are designed to be reversible and are continuously refined to reduce false positives. Website owners can whitelist specific IPs at any time to override automated decisions for their domains.
Threat intelligence data is shared across our infrastructure solely for security purposes. We do not sell this data or use it for advertising or cross-context behavioral profiling. Threat scores automatically decay after inactivity, and security signals and related logs are retained only as long as needed for security and troubleshooting.
Frequently asked questions#
Does the proof-of-work challenge hurt SEO?
No. Verified search engine bots (Google, Bing, and over 25 others) are identified by both their User-Agent and their source IP against officially published ranges. They are exempted before any detection layer runs. A spoofed Googlebot from a random IP does not get this exemption. We explain the full verification process and why it matters for SEO in a separate post.
Can bots solve the JavaScript challenge?
Simple bots (curl, Python scripts, Go HTTP clients) cannot. Headless browsers like Puppeteer and Playwright can. But headless browsers are expensive to operate at scale, and the scoring engine still catches their behavioral patterns even if they solve the challenge. We cover this in detail in the headless browser section above.
Will this block VPN users?
No. VPN status alone cannot trigger enforcement. It contributes a small number of points to the overall score, but an IP needs multiple independent signals firing before any action is taken. A VPN user browsing your site normally will never be challenged.
Does the challenge page affect site speed?
Only for challenged visitors, and only on the first request. The challenge solves in about 2 seconds, sets a signed cookie, and redirects to the original page. Subsequent requests within the cookie’s validity period are not challenged again. Clean traffic is never affected.
What if a legitimate visitor gets challenged by mistake?
They solve the challenge in 2 seconds and continue browsing. If you know the IP is legitimate (your office, a monitoring service, a payment processor), you can whitelist it from the control panel. Whitelisted IPs bypass all enforcement on your domains.
Do I need to configure anything?
No. The system runs automatically on every Hostney server. You can optionally whitelist IPs, enable lockdown mode during attacks, or review threat details from your control panel, but no configuration is required for the protection to work.
Wrapping up#
This system exists because bot traffic is a real operational problem, not a theoretical one. Every request a bot makes is a request your actual visitors are sharing server resources with. We built this to stop bot traffic where it matters most: at the server, before it consumes the resources your real visitors need.
If your site is seeing unexplained CPU spikes, credential stuffing attempts, or scraping traffic, this system is already running on every Hostney server. You can see exactly what it is catching from the Firewall section in your control panel.
If you have questions about how the system handles your specific traffic patterns, reach out to our support team. We are happy to walk through the data with you.