SAGE-AU 2002: TrafficWatch

Tim Bell, Peter Hawkins and Richard Wraith

Trinity College, The University of Melbourne

Abstract

Controlling or recovering the cost of internet traffic is very important in countries such as Australia which charge by the byte. Until recently, Trinity College has borne the cost of its students' downloading, while only charging a flat fee for network access. However, in 2002 we have started accounting for internet use and charging students for use above a particular quota.

The system we have implemented, TrafficWatch, collects traffic data from a number of sources and aggregates it for display and billing purposes. Students register their computers' MAC addresses, and also authenticate for all web proxy access. The network is set up to prevent them from abusing the system by setting a network address from a staff network. The system is written mainly in Python, and uses a PostgreSQL backend.

In addition to the technical aspects of the TrafficWatch project, this paper also discusses the management issues involved, and reports on our experience using TrafficWatch over the last few months.

Background

Trinity College is an affiliated college of the University of Melbourne, with about 280 resident students and tutors, 100 theological students (online distance education and on-site), and 850 foundation studies students.

Along with the other colleges, we connected to the Internet via the university in 1996, with each college having a 10 Mbit half-duplex connection, which impressed us at the time, but is less impressive now. The university charged for data at the rates it was charged, but under the terms of our agreement with the university, we were not permitted to on-charge the students for data. We covered our costs with a flat "network-access" charge of $125 per semester per student.

Times change, circumstances change (Australian Academic and Research Network, AARNet, is now a carrier, and the University of Melbourne have changed their view on allowing us to pass on charges to individual students), and download sizes change. To enable us to charge the big users for their download excesses, while providing a cheap service for the more responsible students, we wanted to start watching and accounting for network traffic. And so the need for TrafficWatch emerged.

TrafficWatch, The Original

The idea for a TrafficWatch system originated with Jack Otto, formerly of Newman College. He implemented a system which was in operation at Newman, tracking the use of a squid web proxy. The system was written using PHP, awk, shell scripts, and a MySQL back end.

Further investigation suggested that the Newman system was not suitable for use at Trinity. It made assumptions about the network services which would be provided; it wasn't very flexible; and the implementation was not conducive to adding the new features we would need to meet our requirements. Since other colleges were also interested in adopting a traffic monitoring system, it was felt that a fresh start was needed so that all requirements could be met. The only component to remain was the name.

The Next Generation

I.T. staff from a number of colleges met to discuss what should be implemented in a new system, how it should be implemented, who would do the implementation, and how it would be paid for.

Requirements

The key requirements identified for the new system were that it should:

Working out the details of the requirements was by far the most painful part of the whole project. Different colleges not only had different requirements, but their philosophies towards traffic usage management often differed so much (for example, whether all web accesses would be required to go through a caching proxy) that at one stage we wondered whether the project would ever be viable due to creeping featurism. Compromises were made so that we were finally able to agree on a system which would be workable for all colleges, and feasible to implement in the time available.

We considered whether an existing package (either open source or proprietary) might meet our needs. There are a number of packages which do traffic accounting (e.g. NeTraMet [8], Traff [11] and TraffAcct [12]), but none we were aware of at the time seemed to meet our exact requirements. (To be honest, we adopted the philosophy often held by research students: a few months' work in the lab can save several hours of research in the library.) In any case, we felt that a home-grown system would give us maximum flexibility and configurability for the varied requirements of the different colleges.

Theory of operation

TrafficWatch is set up differently for each college which uses it, according to their I.T. policies and other requirements. Each college has the ability to control students' internet access through firewall rules and the choice of policies for services such as web proxies. Depending on the college policy and each student's current status (over/under quota, (un)registered, (not) paid up), access is allowed or denied.

At the start of each academic year, student details are loaded into the database. At a minimum, this includes each student's username and email address. The authentication system to be used (e.g. LDAP, password file) is also set up with usernames and passwords, which will be required for students to register and when using authenticated access methods.

To register a MAC address, a student connects their computer to the college network and visits the registration web page, running on the router computer. (Actually, any attempted web accesses are redirected to the registration page until the student has registered.) There, a script identifies the requesting MAC address, and stores that along with the authenticated username in the database.

Once a MAC address is registered, rules are set up on the router to permit that address to access the internet (subject to any firewall restrictions). Details of the size and source IP address of all packets passed to that MAC address are logged on the router, and periodically uploaded to the database.

Services which provide data from the internet to students (such as web proxies) require authentication before use. Each access is then logged with the size of the data downloaded and the username. These logs are periodically uploaded to the database.

A number of web pages provide information on current and historic usage. Students can check to see how much of their quota they have used, or what their bill is so far if they are over quota. Administrators have access to the same information, as well as a breakdown of usage across the different data costings (e.g. AARNet2 traffic, Australian domestic, international).

Feedback on quota usage is also provided to students by email. The first time a student goes over quota, and thereafter once a month, an email is sent detailing how much is owed. These bills are also passed on to the accounting department.

The rate at which data is charged is determined by where the data originates. For example, data from the Victorian Regional Network (which includes all Victorian universities) is free, data from the AARNet mirror is (currently) charged at $15/GB, and international traffic is $68/GB. The rates are adjusted manually when necessary, while the tables which identify which class an IP address belongs to are automatically updated periodically from routing tables provided by the University of Melbourne.

Choice of tools and implementation

The choice of which tools should be used was always going to be contentious. In the end, PostgreSQL [9] was chosen over MySQL (for its better transaction support), and Python [10] was chosen over Perl or PHP (for its cleaner code, which should hopefully improve maintainability). The final justification for these choices was that they were the preferred choice of the student who would be implementing the system.

Trinity College employed a summer student (P.H.) to do the implementation over the period December 2001 to February 2002. Over the same period, the I.T.&T. Manager was preoccupied with the installation of a new telephone system, and the System Administrator was on holiday in Europe, so the implementation was carried out with little direct supervision. Despite this (or perhaps because of it), the implementation was completed in time for deployment in February, before students returned for the start of first semester.

System description

For simplicity, the deployment of the system at Trinity College is described; the deployment at the other colleges where TrafficWatch is used is similar, with slight differences in quota policies, which services are tracked, etc. Except where otherwise noted, all software packages used are standard Debian GNU/Linux packages [2].

The system has components running on a number of servers, which communicate with the database which stores all traffic details. These servers and their roles are shown in Figure 1 and are further described below.

Network and servers
Figure 1. Network and servers

Database

The database uses PostgreSQL version 7.x or later. Data are stored in seven main tables:

The traffic log table stores one row per user per sourceid per day and is the largest table. As of this writing (late May), there are about 70,000 rows in this table, corresponding to about 280 students. The whole database is only a few megabytes in size.

The database server also hosts the web pages for students and administrators to check usage. These are served with Apache, mod_python [7] and Python, with graphs created using Gnuplot [3]. Usage can be displayed for any period, and the data are available up to the start of the day. An example of a data usage page is shown in Figure 2:

Data usage page
Figure 2. Data usage history

Router

Using a Linux [6] box as a router has the advantage of flexibility to support the requirements of a system such as TrafficWatch. The router has three primary roles relating to TrafficWatch: registration of MAC addresses, preventing access by unregistered MAC addresses, and logging access for each MAC address. In addition, the separation of the student network from the other networks in the college prevents students from using non-accounted IP addresses, since these are not routed to the student network. The router also runs DHCPD to automatically assign IP addresses.

MAC address registration

Registration is done by a Python script running under Apache. A student with an unregistered MAC address who attempts to access web pages will be redirected to the MAC address registration form. This grabs the MAC address of the requesting IP address from the ARP table, and then prompts the student for their username and password. It then authenticates the user against an LDAP server, and then inserts the MAC address to username mapping into the database.

Access control

Every five minutes, an updated list of registered MAC addresses is uploaded to the router. This is then processed into a list of iptables [4] rules. These rules recognise registered MAC addresses of packets coming from the student network and then mark those packets. The desired access policy can then be implemented using static rules which permit services to marked packets.

TrafficWatch places few restrictions on what policies can be implemented using it. For example, it is equally possible to allow students to set up internet-accessible servers, or to deny this. (Outbound traffic is not charged to the colleges, so TrafficWatch does not bother to account for such traffic.)

We were concerned that having a list of up to about 300 rules (one per MAC address) which must be traversed for each packet might be too slow; it certainly wouldn't scale well for a larger installation. Therefore, a new iptables extension (consisting of a Netfilter kernel module and an extension module for the iptables userspace program) was written which matches on a configurable number of the low order bits of the MAC address. For example, if five bits are used, then the 300 rules can be distributed into 32 chains, with 32 new rules to first select which chain each packet must traverse.

IP traffic accounting

To account for network traffic at the router, a custom daemon called ipcap is used to log all traffic. ipcap was written for this purpose since no free tools were found that suited the application well enough. (ipcap performs a role not dissimilar to that of Cisco netflow and other similar tools.) The ipcap daemon uses libpcap [5] to sniff packets, and then attempts to group packets into data flows, which it outputs to a logfile. These logs are inserted into the database periodically.

Squid

Currently, the only authenticated traffic source Trinity uses is the squid web proxy. Squid has powerful access control capabilities, as well as configurable authentication. Since we have other users (apart from students) who need unfettered access to the web, squid is configured to require authentication only from the student network ranges. Authentication is done against an LDAP server, using the ldap_auth program which is part of the Debian squid package.

The logs from squid are processed daily after rotation, and inserted into the database.

Data formats

Traffic information is stored temporarily and transferred to the database server in a number of log files which have the same basic format. For example, a line from the router log looks like this:

  1026864622 - 0:30:ca:b8:d9:bc 203.28.240.189 64.4.13.147 41 0
      
The fields are space-separated, and are:
  1. timestamp (seconds since epoch)
  2. authenticated username, if known; "-" otherwise
  3. local client MAC address
  4. local client IP address
  5. remote IP address
  6. bytes received
  7. bytes sent (not accounted)
A line from the proxy server would omit the MAC address (substituting "-"), but have an authenticated username. This simple logfile format makes it relatively easy to add new sources of traffic information.

Authentication information which relates MAC addresses, IP addresses and usernames is stored in logfiles in a second format:

  1026864600 - 0:30:ca:b8:d9:bc 203.28.240.189
      
These fields have the same interpretation as the first four fields above. The timestamp marks the start of the period for which the MAC/IP address mapping is valid; the mapping continues to be valid until a contradictory line is encountered in the log file.

When ARP information from the router is used to correlate MAC and IP addresses (as it is at Trinity), the second field will not contain a username. In this case, identifying the user to bill for traffic on a given IP address takes an extra step, namely looking up a MAC address in the database of registered MAC addresses.

Security

The security of TrafficWatch is obviously important, since students will be billed on the basis of the data collected. However, the environment in which TrafficWatch operates is challenging from a security perspective, and so our approach has been to minimise risks where feasible, and accept them elsewhere. This approach is backed up by various human techniques which are possible in a close-knit community such as a residential college.

All privileged communications happen behind the college's firewall, so no special security measures are taken to protect the system from potential outside attackers. In any case, the most likely threat comes from within. Attempts to compromise the system from inside the firewall are mitigated by a number of factors:

This last point illustrates the human approach which helps reassure us that the figures reported by TrafficWatch are not being significantly compromised. For example, our heaviest user of network traffic has a reputation for his voracious web surfing appetite. If another user suddenly increased their usage in an uncharacteristic way, and our heavy user reduced his usage, this would raise our suspicions, prompting us to investigate.

In operation

TrafficWatch was deployed in February, in time for the start of first semester. Documentation about the system was added to the college I.T. web pages, and returning students had few problems adapting to the new system.

There have been a few problems with TrafficWatch. In most cases, problems have been resolved with little difficulty. It is particularly helpful that the implementor is readily available, and uses the system like all the other students.

Some of the problems with TrafficWatch have prevented students from accessing the internet, sometimes for a few hours. While these outages would be better avoided, they do not present a critical problem in our application, and students accept the occasional outage with few complaints.

Here is a summary of problems encountered.

In addition to Trinity College, TrafficWatch has been deployed in two other colleges so far. We have not received much feedback on the success of those deployments (but we haven't received complaints either).

Future

There are a number of extensions and improvements to TrafficWatch which are contemplated.

TrafficWatch is Free Software released under the GPL. It is available from http://software.trinity.unimelb.edu.au/trafficwatch/.

References

  1. BPALogin homepage http://bpalogin.sourceforge.net/
  2. Debian homepage http://www.debian.org/
  3. Gnuplot homepage http://www.gnuplot.info/
  4. iptables homepage http://www.iptables.org/
  5. libpcap (tcpdump) homepage http://www.tcpdump.org/
  6. Linux kernel v2.4 http://www.kernel.org/
  7. mod_python homepage http://www.modpython.org/
  8. NeTraMet homepage http://www2.auckland.ac.nz/net//NeTraMet/
  9. PostgreSQL homepage http://www.au.postgresql.org/
  10. Python homepage http://www.python.org/
  11. Traff homepage http://sourceforge.net/projects/traff/
  12. TraffAcct homepage http://www.hughes.com.au/products/traffacct/

(URLs accessed July 2002.)

Author Biographies

Tim Bell is the System Administrator at Trinity College. He has a B.Sc. (Hons) from the University of Melbourne, and was a resident tutor at Trinity for four years while undertaking postgraduate study. System administration has distracted him from working on his (sadly incomplete) Ph.D. in the field of information retrieval for over five years.

Peter Hawkins is a resident at Trinity College, studying Science/Engineering at the University of Melbourne. He enjoys running in circles and melismas.

Richard Wraith is the I.T.&T. manager at Trinity College. Also a former Trinity resident, he has a B.E. (Hons) and Ph.D. in Mechanical Engineering from the University of Melbourne. He and Tim were responsible for setting up the network at Trinity College in 1996/7. Richard has a bicycle, a dog, two children and a wife.