When (fake) Googlebots attack your rails app

We’ve all scraped a site or two (rite?) but what do you do when bad actors start taking up a non-trivial amount of resources on your app? Further, what if bad actors are masking their user agent to appear as though they were googlebot?

Kickstarter open sourced their rack middleware designed to protect your web app from bad clients, named rack-attack.

It works by cutting off requests early in the process, a couple of milliseconds in, and returning a 429 Too Many Requests status.

In a recent project, web utilization was crossing 90% with database utilization approacing 60%. Memory utilization was creeping and web response times were crossing 5 seconds. Things were not happy in Rails land.

An inspection of the logs and skylight performance monitoring brought three things to my attention:

There were quite a few requests per second to the product listing and searching routes
There were three IP addresses that seemed to appear over and over and over again. They linked to bluehost.com servers and other not-very-legit addresses.
The user agent being used was Googlebot’s

How to rate limit

This is my first line of defense – can I limit a user to a request per second? The rate you choose is pretty much up to you; I went with 300 over a 5 minute period of time. This is 5 requests per second per IP.

This particular app has two types of requests I didn’t want to rate limit:

Assets like images, stylesheets, webfonts, javascripts, etc
The Load Balancer (HAProxy) runs a /check on the server to check if it’s still online. We’ll let those go though unimpeded.

Rack::Attack.throttle('req/ip', limit: 300, period: 5.minutes) do |req|
  req.remote_ip if ['/assets', '/check'].any? {|path| req.path.starts_with? path }
end

How to block an IP address

In my #2 above, I blocked the two bad actor IP addresses. IP Addresses blurred because of obvs reasons (obvs).

Rack::Attack.blacklist('block bad actors') do |req|
  ['10.1.1.1', '10.1.1.2'].include? req.ip
end

Why would they pretend they’re googlebot?

I’ve seen rate limiters let Googlebot by by default. Because why block google when you want Google to visit your literally as often as possible because SEO?

Knowing this information, it’s likely that as an evil-doer-scraper you’ll set your user-agent to match Googlebot’s to maximize the chance you’ll be let in, rate-limits be foresaken and ignored.

How to reverse lookup user agents to verify Googlebots are actually googlebots?

It’s vitaly important to let actual Googlebots through to your site, but what about fake lying liar googlebots? Those we want to 429.

Google helpfully published Verifying Googlebot which states the following:

There is no list of valid IPs to allow for Googlebot
if you host the-ip-address it will return the host for that ip. Such as crawl-66-249-66-1.googlebot.com
All googlebots will end in googlebot.com or google.com
if you host crawl-66-249-66-1.googlebot.com it should match the the-ip-address you started with

Soooooo, if a HTTP request proclaims itself as a Googlebot user-agent, we could use the Resolv library in ruby to verify it. Resolv is concurrent and does not block the world \o/

require 'resolv'

Rack::Attack.blacklist('googlebots who are not googlebots') do |req|
  if req.user_agent =~ /Googlebot/i

    begin
      name = Resolv.getname(req.ip.to_s)
      reversed_ip = Resolv.getaddress(name)

      resolves_correctly = name.end_with?("googlebot.com") || name.end_with?("google.com")
      reverse_resolves = reversed_ip == req.ip.to_s

      is_google = resolves_correctly && reverse_resolves


      !is_google
    rescue Resolv::ResolvError 
      true
    end

  end
end

How to make sure this still works behind your Proxy / Load Balancer

If you use HAProxy, this code is for you! Generally, your load balancer (HAProxy, heroku, etc) might present itself as the IP address the request is coming from. You do not want to rate limit based on requests from the proxy.

We’ll change all req.ip to req.remote_ip and add this code which looks for the HTTP_X_FORWARDED_FOR header added by most load balancers. If not found, it will default to the IP.

class Rack::Attack
  class Request < ::Rack::Request
    def remote_ip
      @remote_ip ||= (env['HTTP_X_FORWARDED_FOR'] || ip).to_s
    end
  end
end

The Code

Final Code to make the awesome happen: