Webserver Bandwidth Limiting in Apache

by Marion Bates <mbates at whoopis.com>

Running a website with a large quantity of useful content is a double-edged sword. You get a lot of hits, your search rank goes up, and people like your site. That's wonderful, but for every ten legitimate users, there's at least one LUSER who decides that he/she wants the whole site. This is bad enough under normal circumstances, but it's even worse when you're on a cohosted link and/or a shared server co-op, because it ruins the experience for the other clients AND the other servers.

Such was the situation today, when I was recently made aware that a section of my website was generating the vast majority of network traffic on a shared machine on a shared network. I scanned quickly for runaway/zombie procs and saw nothing, except an unusually large number of httpd processes. I checked my access log with tail -f, and, scrolling down the screen faster than I could read them, were log entries like

[a-luser-ip] - - [07/Feb/2005:12:15:15 -0500] "HEAD / HTTP/1.1" 301 - "http://www.mysite.com/" "SiteSucker/1.6.4"
[another-luser-ip] - - [07/Feb/2005:12:47:50 -0500] "HEAD / HTTP/1.0" 200 - "http://www.mysite.com/" "Wget/1.7"

...in other words, some users were running programs designed specifically for the purpose of retrieving all the content of a website (as in the case of "SiteSucker" -- how subtle), or they were running other things in ways to achieve the same result (as in the case of wget, a very useful little tool which is frequently abused for this sort of thing.)

Well, of course I didn't want to have to shut down my site that so many people find useful, but this abuse couldn't continue. I added a couple of the offending IPs to my firewall ban list, but I needed a more long-term, comprehensive solution. So I started to google around for bandwidth-throttling measures that are specific to web traffic, and I came across bw_mod by Ivan "Bruce" Barrera. It, along with a couple of built-in directives in Apache, look to be the easiest solution to my problem, and here's how I implemented them.

Part I: bandwidth mod | Part II: BrowserMatch

System information: RedHat Linux 9 for x86, Apache 2.0.40 from stock RPMs, 10MBit network link shared by approximately 20 other linux servers. PHP, SSL, and realm authentication all enabled.

Part I: There's very little I need to add about bandwidth mod (bw_mod), since this author's readme explains everything you need to know in a very easy-to-follow, non-technical way. I'll simply condense it and leave out the "if this doesn't work, try that" sections which he kindly included.

  1. Get httpd-devel. If you don't already have it, install httpd-devel. I did it from the RPM.
    rpm -Uvh /home/mbates/rh9rpms/httpd-devel-2.0.40-21.i386.rpm 
  2. Download bm_mod. Download the current version of bm_mod from the website (note that the version may change, his last update was dated only about two weeks ago). As of this writing, it's v. 0.5rc1. http://www.ivn.cl/apache/bw_mod-0.5rc1.tgz If the version number has changed, you will need to modify the following commands to match.

    Untar it with

    tar -xvzf bw_mod-0.5rc1.tgz

    This will create a directory called "bw_mod-0.5".

  3. Install bm_mod. As root, do:
    cd bw_mod-0.5
    /usr/sbin/apxs -i -a -c bw_mod-0.5rc1.c

    NOTE: that your binary may be called "apxs2", and/or may have a different path.

    You should see output like:

    /usr/lib/httpd/build/libtool --silent --mode=compile gcc -prefer-pic -O2 -g -pipe -march=i386 -mcpu=i686 -I/usr/kerberos/include -DAP_HAVE_DESIGNATED_INITIALIZER -DLINUX=2 -D_REENTRANT -D_XOPEN_SOURCE=500 -D_BSD_SOURCE -D_SVID_SOURCE -D_GNU_SOURCE -pthread -DNO_DBM_REWRITEMAP -I/usr/include/httpd  -c -o bw_mod-0.5rc1.lo bw_mod-0.5rc1.c && touch bw_mod-0.5rc1.slo
    /usr/lib/httpd/build/libtool --silent --mode=link gcc -o bw_mod-0.5rc1.la -rpath /usr/lib/httpd/modules -module -avoid-version   bw_mod-0.5rc1.lo
    /usr/lib/httpd/build/instdso.sh SH_LIBTOOL='/usr/lib/httpd/build/libtool' bw_mod-0.5rc1.la /usr/lib/httpd/modules
    /usr/lib/httpd/build/libtool --mode=install cp bw_mod-0.5rc1.la /usr/lib/httpd/modules/
    cp .libs/bw_mod-0.5rc1.so /usr/lib/httpd/modules/bw_mod-0.5rc1.so
    cp .libs/bw_mod-0.5rc1.lai /usr/lib/httpd/modules/bw_mod-0.5rc1.la
    cp .libs/bw_mod-0.5rc1.a /usr/lib/httpd/modules/bw_mod-0.5rc1.a
    ranlib /usr/lib/httpd/modules/bw_mod-0.5rc1.a
    chmod 644 /usr/lib/httpd/modules/bw_mod-0.5rc1.a
    PATH="$PATH:/sbin" ldconfig -n /usr/lib/httpd/modules
    Libraries have been installed in:
    If you ever happen to want to link against installed libraries
    in a given directory, LIBDIR, you must either use libtool, and
    specify the full pathname of the library, or use the `-LLIBDIR'
    flag during linking and do at least one of the following:
       - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
         during execution
       - add LIBDIR to the `LD_RUN_PATH' environment variable
         during linking
       - use the `-Wl,--rpath -Wl,LIBDIR' linker flag
       - have your system administrator add LIBDIR to `/etc/ld.so.conf'
    See any operating system documentation about shared libraries for
    more information, such as the ld(1) and ld.so(8) manual pages.
    chmod 755 /usr/lib/httpd/modules/bw_mod-0.5rc1.so
    [activating module `bw' in /etc/httpd/conf/httpd.conf]
  4. Check/correct httpd.conf. Check, and correct if necessary, httpd.conf. This is the only part of the installation where I had trouble, and I'm not sure why, because I couldn't reproduce the result; but, the installation seemed to have stuck the new directive in the wrong part of httpd.conf, and it wasn't working until I moved it.

    It started off with the new LoadModule line inside the worker.c block, like this (new line shown in red):

    <IfModule worker.c>
    LoadModule bw_module          /usr/lib/httpd/modules/bw_mod-0.5rc1.so
    LoadModule cgid_module modules/mod_cgid.so

    After some troubleshooting, I deleted it from there, and placed it a few lines up, at the end of the long LoadModule section.

  5. Turn it on and create your bandwidth settings. Just below the LoadModule line, I added the directive to enable the module, and made my needs-specific configurations regarding bandwidth-limiting. So, the whole section looks like:
    LoadModule proxy_http_module modules/mod_proxy_http.so
    LoadModule proxy_connect_module modules/mod_proxy_connect.so
    ##### NEW added Feb 7 2005 to limit bandwidth
    # First load the module
    LoadModule bw_module          /usr/lib/httpd/modules/bw_mod-0.5rc1.so
    # Now enable it
    BandWidthModule On
    # Now set the default, which is no limit -- 
    # we will tweak it later.
    BandWidth all 0
    # PDFs larger than 1MB go at 10k/sec max
    LargeFileLimit .pdf 1000 10000
    # No more than 40 connections
    MaxConnection all 40
    ##### end bandwidth limit section
    <IfModule prefork.c>
    LoadModule cgi_module modules/mod_cgi.so
Note that there is much more to bw_mod than this. In my particular case, the files that users (and lusers) are after, all happen to be PDFs, and one of bw_mod's features is to limit based on filename. Here are a few more options and examples, culled from the readme enclosed with the program:

Don't forget to restart apache (service httpd restart) after applying any of these changes.

Part II: Apache has a built-in directive called BrowserMatch, and its sister BrowserMatchNoCase. These allow you to block clients based on their UserAgent string, and they will work without mod_rewrite. The UserAgent string can be spoofed, but this will take care of the vast majority of your would-be bandwidth hogs.

  1. To httpd.conf, add the following:
    BrowserMatchNoCase ^NameOfBadProgram1 nameofenv
    BrowserMatchNoCase ^NameOfBadProgram2 nameofenv
    BrowserMatchNoCase ^NameOfBadProgram3 nameofenv
    Use the same "nameofenv" value for all of the agents you want to block. I added this section to some preexisting BrowserMatch directives that had to do with forcing HTTP responses to certain browser versions. Here's what mine looks like right now, I will be adding to it as my logs reveal new twits:
    # NEW Feb 7 2005 anti-bandwidth-sucker measures
    BrowserMatchNoCase ^wget suckers
    BrowserMatchNoCase ^SiteSucker suckers
    BrowserMatchNoCase ^iGetter suckers
    BrowserMatchNoCase ^larbin suckers
    BrowserMatchNoCase ^LeechGet suckers
    BrowserMatchNoCase ^RealDownload suckers
    BrowserMatchNoCase ^Teleport suckers
    BrowserMatchNoCase ^Webwhacker suckers
    BrowserMatchNoCase ^WebDevil suckers
    BrowserMatchNoCase ^Webzip suckers
    BrowserMatchNoCase ^Attache suckers
    BrowserMatchNoCase ^SiteSnagger suckers
    BrowserMatchNoCase ^WX_mail suckers
    BrowserMatchNoCase ^EmailCollector suckers
    BrowserMatchNoCase ^WhoWhere suckers
    BrowserMatchNoCase ^Roverbot suckers
    BrowserMatchNoCase ^ActiveAgent suckers
    BrowserMatchNoCase ^EmailSiphon suckers

  2. Now, inside the Directory blocks for each directory you want to apply these to, put
    deny from env=suckers 
    You could also add this to individual directories' .htaccess files, but I have not tried this yet. I simply added it at the end of my main Directory block:
    <Directory /home/www.whoopis.com/html>
    		Options Indexes FollowSymLinks
    		AllowOverride AuthConfig 
    		deny from env=suckers

You can create other environments for things like known email-harvesting bots, known-evil web spiders, etc. and do more creative things based on which type of malicious visitors they are. I am content to just refuse connections from all of them.

Don't forget to restart apache (service httpd restart) after applying any of these changes.

Looking for web spiders and site suckers: Here's a shortcut for identifying all user agents in your logs.

cat access.log | awk -F "\"" {'print $6'} | sort | uniq | grep -v Mozilla

Translation: Run the logfile through a filter, treating double-quotes as field separators, pick out the 6th field (which is the UserAgent field), sort the resulting list, toss out all duplicates, and don't show me anything containing "Mozilla". NOTE that many programs include "Mozilla" in their useragent strings; also, some download accelerators operate as plugins to the regular browser and then append their useragent specifics to the browser's string. So, if you want to be more thorough than this, leave off everything after "uniq" and you'll see it all -- including stuff like

Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; CDSource=v13b.08; FunWebProducts)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Alexa Toolbar; mxie; Maxthon)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Sgrunt|V104|615|S-132489664; SV1; InterFREE Kit; .NET CLR 1.1.4322)
which may or may not be legit, but are definitely unusual.

This will take awhile for large logs. Leave off everything after "uniq" to see all of the UserAgents; I filter out Mozilla since it is what appears for most normal browsers, including IE.

Also, this assumes you are using "combined" log format. If you don't get the results you expect (see below), then try replacing the "$6" with another number -- your log format may order the fields differently.

Sample results:

bash2-2.05b# cat /var/log/httpd/access.log | awk -F "\"" {'print $6'} | sort | uniq | grep -v Mozilla

Advanced Browser (http://www.avantbrowser.com)
Avant Browser (http://www.avantbrowser.com)
CoralWebPrx/0.1.12 (See http://coralcdn.org/)
DA 5.3
ELinks/0.11.CVS (textmode; Linux 2.6.10 i686; 142x68-3)
FDM 1.x
Googlebot/2.1 (+http://www.google.com/bot.html)
HAM version
Html Link Validator (www.lithopssoft.com)
IRLbot/1.0 (+http://irl.cs.tamu.edu/crawler)
Iltrovatore-Setaccio/1.2 (It-bot; http://www.iltrovatore.it/bot.html; info@iltrovatore.it)
LeechGet 2004 (www.leechget.net)
Links (2.1pre15; Linux 2.6.7-hardened-r16 i686; 80x40)
Lynx/2.8.4rel.1 libwww-FM/2.14
Opera/7.50 (X11; Linux i386; U)  [en]
SIE-M55/10 UP.Browser/ (GUI) MMP/1.0 (Google WAP Proxy/1.0)
SafariBookmarkChecker/1.26 (+http://www.coriolis.ch/)
Space Bison/0.02 [fu] (Win67; X; SK)
SurveyBot/2.3 (Whois Source)
appie 1.1 (www.walhello.com)
curl/7.10.2 (powerpc-apple-darwin7.0) libcurl/7.10.2 OpenSSL/0.9.7b zlib/1.1.4
findlinks/0.87 (+http://wortschatz.uni-leipzig.de/findlinks/)
gamekitbot/1.0 (+http://www.uchoose.de/crawler/gamekitbot/)
iCab/2.9.8 (Macintosh; U; 68K)
iGetter/2 (Macintosh; U; PPC Mac OS X; en)
larbin_2.6.3 (larbin2.6.3@unspecified.mail)
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
pipeLiner/0.7 (PipeLine Spider; http://www.pipeline-search.com/webmaster.html; webmaster@pipeline-search.com)
psbot/0.1 (+http://www.picsearch.com/bot.html)
updated/0.1beta (updated.com; http://www.updated.com; crawler@updated.om)

When you see something odd or suspicious (like "LeechGet", gee I wonder what that does), Google around for it. If the name is too general, add "user agent" or "spider" or "search engine" to your query.