Section: Scientific Foundations
Keywords : TCP traces, Passive measurements.
Measurements and Mathematical Modeling
Characterization of Internet traffic has become over the past few years one of the major challenging issues in telecommunications networks. As a matter of fact, understanding the composition and the dynamics of Internet traffic is essential for network operators in order to offer quality of service and to supervise their networks. Since the celebrated paper by Leland et al on the self-similar nature of Ethernet traffic in local area networks, a huge amount of work has been devoted to the characterization of Internet traffic. In particular, different hypotheses and assumptions have been explored to explain the reasons why and how Internet traffic should be self-similar.
A common approach to describing traffic in a backbone network consists of observing the bit rate process evaluated over fixed length intervals, say a few hundreds of milliseconds. Long range dependence as well as self-similarity are two basic properties of the bit rate process, which have been observed through measurements in many different situations. Different characterizations of the fractal nature of traffic have been proposed in the literature (see for instance Norros on the monofractal characterization of traffic). An exhaustive account to fractal characterization of Internet traffic can be found in the book by Park and Willinger. Even though long range dependence and self similarity properties are very intriguing from a theoretical point of view, their significance in network design has recently been questioned.
While self-similar models introduced so far in the literature aims at describing the global traffic on a link, it is now usual to distinguish short transfers (referred to as mice) and long transfers (referred to as elephants)  . This dichotomy was not totally clear up to a recent past (see for instance network measurements from the MCI backbone network). Yet, the distinction between mice and elephants become more and more evident with the emergence of peer-to-peer (p2p) applications, which give rise to a large amount of traffic on a small number of TCP connections. The above observation leads us to analyze ADSL traffic by adopting a flow based approach and more precisely the mice/elephants dichotomy. The intuitive definition of a mouse is that such a flow comprises a small number of packets so that it does not leave or leaves slightly the slow start regime. Thus, a mouse is not very sensitive to the bandwidth sharing imposed by TCP. On the contrary, elephants are sufficiently large so that one can expect that they share the bandwidth of a bottleneck according to the flow control mechanism of TCP. As a consequence, mice and elephants have a totally different behavior from a modeling point of view.
In our approach, we think that describing statistical properties of the Internet traffic at the packet level is not appropriate, mainly because of the strong dependence properties noticed above. It seems to us that, at this time scale, only signal processing techniques (wavelets, fractal analysis, ...) can lead to a better understanding of Internet traffic. It is widely believed that at the level of users, independence properties (like for telephone networks) can be assumed, just because users behave quite independently. Unfortunately, there is not, for the moment, a stochastic model of a typical user activity. Some models have been proposed, but their number of parameters is too large and most of them cannot be easily inferred from real measurements. We have chose to look at the traffic of elephants and mice which is an intermediate time scale. Some independence properties seem to hold at that level and therefore the possibility of Markovian analysis. Note that despite they are sometimes criticized, Markovian techniques are, basically, the only tools that can give a sufficiently precise description of the evolution of various stochastic models (average behavior, distribution of the time to overflow buffers,...).
Sampling the Internet Traffic
Traffic measurement is an issue of prime interest for network operators and networking researchers in order to know the nature and the characteristics of traffic supported by IP networks. The exhaustive capture of traffic traces on high speed backbone links, with rates larger than 1 Gigabit/s, however, leads to the storage and the analysis of huge amounts of data, typically several TeraBytes per day. A method of overcoming this problem is to reduce the volume of data by sampling traffic. Several sampling techniques have been proposed in the literature (see for instance  ,  and references therein). In this paper, we consider the deterministic 1/ N sampling, which consists of capturing one packet every other Npackets. This sampling method has notably been implemented in CISCO routers under the name of NetFlow which is widely deployed nowadays in commercial IP networks.
The major issue with 1/ N sampling is that the correlation structure of flows is severely degraded and then any digital signal processing technique turns out very delicate to apply in order to recover the characteristics of original flows  . An alternative approach consists of performing a statistical analysis of flow as in  ,  . The accuracy of such an analysis, however, greatly depends on the number of samples for each type of flows, and may lead to quite inaccurate results. In fact, this approach proves efficient only in the derivation of mean values of some characteristics of interest, for instance the mean number of packets or bytes in a flow.
Algorithms of Sampling
Deriving the general characteristics of the TCP traffic circulating at some edge router has potential applications at the level of an ISP. It can be to charge customers propotionaly to their use of the network for example. It can be also to detect what is now called « heavy users ».
Another important application is to detect the propagation of worms, attacks by denial of service (DoS). And, once the attack is detected, to counter it with an appropriate algorithmic approach. Due to the natural variation of the Internet traffic, such a detection (through sampling !) is not obvious. Robust algorithms have to be designed to achieve such an ambitious goal. An ultimate (and ambitious !) goal would be of having an automatic procedure to counter this kind of attacks.
Propose a fairly simple and accurate estimation of the traffic circulating in an ADSL network. A limited number of parameters should characterize the traffic at the first order. Note that ADSL traffic is significantly different from the usual academic traffic analyzed up to now (more than 80% of the ADSL traffic is from Peer to Peer networks).
Infer through sampling the parameters of the model proposed to describe the ADSL traffic.
Design and analyze algorithms to detect in sampled traffic attacks by worms or DoS and more generally unusual events.