Section: New Results
Mathematical Models of Traffic Measurements
Sampling ADSL traffic
The exhaustive capture of traces on high speed backbone link leads to the storage and the analysis of huge amount of data. In order to limit the consumption of memory in routers, passive traffic measurements employ sampling at the packet level. Indeed, sampling techniques are implemented on CISCO routers (under the name of NetFlow). Flow statistics are formed by routers from the sampled substream of packets. Sampling entails a loss of information. The first question is whether sampling succeed in estimating the characteristics of the original traffic.
The aim of the study is to estimate the parameters of the real ADSL traffic from the sampled traffic. We use an a priori knowledge of the traffic, through the model developed in our previous work from the analysis of ADSL traces. Here the model is simplified a lot because mice are not seen by sampling and p2p traffic is predominant. Roughly speaking, traffic is mainly composed by p2p elephants. More precisely, the flows are chunks of elephants, due to the p2p algorithms. The analysis of traces leads to model the traffic by a M/ G/ queue where the customers are flows and their duration has a Weibull distribution. Sampling consists in choosing a customer at random every time step . The traffic is characterized by a few parameters which have to be estimated : The arrival rate and the two parameters of the Weibull distribution of the flow duration.
A first approach gives that, in case of heavy traffic i.e. if the arrival rate tends to infinity and if the sampling step tends to 0 while / tends to a constant c, then the sampling times of a permanent flow are the instants of a Poisson process with intensity 1/ c . This property is used to determine the arrival rate . If the duration of the flow is Weibull then the duration of the sampled flow, given that it is sampled more than twice, is also Weibull. It gives a way to estimate the parameters of the Weibull distribution. In practice, this is not satisfactory since the estimation of the tail distribution is not easy when the sampling step is large (one packet every thousand).
An alternative approach is to use quantities whose mean can be obtained as a function of the key parameters, typically the number Wk of flows sampled less (resp. more) than ktimes in a given time interval. There exists a scaling of such that this mean tends to a constant. In this case, Chen-Stein method is used to prove the convergence in distribution of Wk to a Poisson distribution when the total number of flows is large. This method is powerful enough to give precise estimates of the distance of the distributions. When the mean tends to infinity, a normal approximation can be also obtained as a consequence. The system is reduced to dynamical urn model because the flows are not permanent in the time interval.
In practice, the ratio is assumed to be a constant and the elephants can, in this case, be considered as permanent. It has been proved that a normal approximation holds for the number of flows sampled more than k1 times, when the ratio of the number of flows to the number of sampling times is small. Comparing experimental values obtained on traces and theoretical ones, we obtained a discrepancy which is probably be due to the bursty nature of the data elephants or the presence of mice. This point is currently under investigation.
On Line Algorithms For Traffic Measurements
We are interested here in detecting and estimating the number of flows traversing a router in the network. The characterization of the flow statistics is of interest for the detection of attacks or anomalies, it can be also used to charge the clients in function of the traffic generated, also in traffic engineering. Moreover, Internet providers can infer the clients application (Peer-to-Peer, voice over IP, web, ftp...) without looking at the packets contents.
We focus on big flows (those who exceed a certain number of packets Tor occupy more than certain percentage of the total available bandwidth). Indeed, it is known that big flows represent the majority of the traffic volume, for example, we know that less than 9% of the flows exchanged between AS represent up to 70% of the total number of bytes exchanged between all the AS pairs. Also, for a lot of applications, the knowledge of those big flows is sufficient to characterize the traffic.
To answer this question, we proposed an algorithm based on the use of Tparallel Bloom filters, each filter ihas a counter Ci . Initially, all the Tfilters are empty and the different counters also initialized to 0. Upon the reception of a flow F, we look for the first parallel filter (determined by a hashing function) where flow Fdoes not exist yet, then we increase the value of the counter of this filter by 1 and we fill the different bits corresponding to Fby 1. When the size of the filters is well parametrized, all the flows of size bigger than ireach the filter iwith a negligible proportion of flows of size smaller than i. Consequently, we use the value of the counter Ci as an estimator of the total number of flows of size larger than i. Since this algorithm must run in real time without interruption, all the filters become saturated after a while especially because of the contribution of mice and the estimation error becomes unacceptable. To deal with this problem, we have proposed an adaptive mechanism which cleans out the filters regularly and maintains the filling of the filters under a certain threshold (50% in our case). Indeed, as soon as this threshold is reached, we remove some packets by reinitializing the first parallel filters and moving them to the end of the Tfilters.
The simulations we made show that the least we erase the packets the best is the estimation (it is better to remove one packet than Tpackets). Indeed, by totally cleaning out the Tfilters, we remove the contribution of all the mice but we make a lot of errors in the detection and the statistics of the elephants in the contrary of erasing only one packet. The simulations show also that the relative error of the total number of elephants is maintained low around 3 to 4% and is stabilized over a long period of time. The first moments of the elephants size (average, and variance) show also a satisfying concordance with real statistics of elephants.
These algorithms have been successfully tested on ADSL traces corresponding to two hours of traffic.