When (Fake) Googlebots Attack Your Rails App
We’ve all scraped a site or two (rite?) but what do you do when bad actors start taking up a non-trivial amount of resources on your app? Further, what if bad actors are masking their user agent to appear as though they were googlebot?
Kickstarter open sourced their rack middleware designed to protect your web app from bad clients, named rack-attack.
It works by cutting off requests early in the process, a couple of milliseconds in, and returning a 429 Too Many Requests status.
In a recent project, web utilization was crossing 90% with database utilization approacing 60%. Memory utilization was creeping and web response times were crossing 5 seconds. Things were not happy in Rails land.
An inspection of the logs and skylight performance monitoring brought three things to my attention:
- There were quite a few requests per second to the product listing and searching routes
- There were three IP addresses that seemed to appear over and over and over again. They linked to bluehost.com servers and other not-very-legit addresses.
- The user agent being used was Googlebot’s
How to rate limit
This is my first line of defense – can I limit a user to a request per second? The rate you choose is pretty much up to you; I went with 300 over a 5 minute period of time. This is 5 requests per second per IP.
This particular app has two types of requests I didn’t want to rate limit:
- Assets like images, stylesheets, webfonts, javascripts, etc
- The Load Balancer (HAProxy) runs a
/check
on the server to check if it’s still online. We’ll let those go though unimpeded.
Rack::Attack.throttle('req/ip', limit: 300, period: 5.minutes) do |req| req.remote_ip if ['/assets', '/check'].any? {|path| req.path.starts_with? path } end
How to block an IP address
In my #2 above, I blocked the two bad actor IP addresses. IP Addresses blurred because of obvs reasons (obvs).
Rack::Attack.blacklist('block bad actors') do |req| ['10.1.1.1', '10.1.1.2'].include? req.ip end
Why would they pretend they’re googlebot?
I’ve seen rate limiters let Googlebot by by default. Because why block google when you want Google to visit your literally as often as possible because SEO?
Knowing this information, it’s likely that as an evil-doer-scraper you’ll set your user-agent to match Googlebot’s to maximize the chance you’ll be let in, rate-limits be foresaken and ignored.
How to reverse lookup user agents to verify Googlebots are actually googlebots?
It’s vitaly important to let actual Googlebots through to your site, but what about fake lying liar googlebots? Those we want to 429.
Google helpfully published Verifying Googlebot which states the following:
- There is no list of valid IPs to allow for Googlebot
- if you
host the-ip-address
it will return the host for that ip. Such ascrawl-66-249-66-1.googlebot.com
- All googlebots will end in
googlebot.com
orgoogle.com
- if you
host crawl-66-249-66-1.googlebot.com
it should match thethe-ip-address
you started with
Soooooo, if a HTTP request proclaims itself as a Googlebot user-agent, we could
use the Resolv
library in ruby to verify it. Resolv is concurrent and does not
block the world \o/
require 'resolv' Rack::Attack.blacklist('googlebots who are not googlebots') do |req| if req.user_agent =~ /Googlebot/i begin name = Resolv.getname(req.ip.to_s) reversed_ip = Resolv.getaddress(name) resolves_correctly = name.end_with?("googlebot.com") || name.end_with?("google.com") reverse_resolves = reversed_ip == req.ip.to_s is_google = resolves_correctly && reverse_resolves !is_google rescue Resolv::ResolvError true end end end
How to make sure this still works behind your Proxy / Load Balancer
If you use HAProxy, this code is for you! Generally, your load balancer (HAProxy, heroku, etc) might present itself as the IP address the request is coming from. You do not want to rate limit based on requests from the proxy.
We’ll change all req.ip
to req.remote_ip
and add this code which looks
for the HTTP_X_FORWARDED_FOR
header added by most load balancers. If not
found, it will default to the IP.
class Rack::Attack class Request < ::Rack::Request def remote_ip @remote_ip ||= (env['HTTP_X_FORWARDED_FOR'] || ip).to_s end end end
The Code
Final Code to make the awesome happen: