megaklion.blogg.se - Use datathief with no axis

Only a few points are distributed and scattered to the right side. One can see that most of the data points are located along the y-axis. This leads to three different clusters (blue, red, green).

You can see that k-means was parameterized with k=3. The x-axis represents the total number of bytes and the y-axis the service count. In illustration 2 you can see a very simple two-dimensional example clustering of some sample firewall connections. Illustration 1 shows a simple example where two variables (packets and bytes) with totally different ranges are transformed into a comparable range with the help of z-score transformation. Second, all values have to be in the same range in order to avoid biased results. First, the algorithm can only handle numerical values. Scaling is an important pre-processing step for the k-means algorithm used later on. A very important fact is that all features, no matter if they are basic or advanced features, have to be scaled numerical values. But there are also some more advanced features which can be derived from a specified sliding time window such as the number of connections to the same service as the current one-we call this feature the service count. Some basic features in the IP network context are the duration or direction of a connection. Based on this information we want to classify a connection as either normal or abnormal. Features, also called attributes or variables, characterize the sample. But how can we decide whether this is a normal or abnormal connection? Feature engineering and anomaly detection approachįirst, we must extract the so-called features of the data sample. We have some basic information about the connection, like IP addresses, ports, timestamps and the number of bytes and packets. Please note that all shown IP addresses are anonymized.

In listing 1 you see an exemplary connection of the firewall system. More details about the clustering can be found in section “Feature engineering and anomaly detection approach”.īut first, let’s have a closer look at the structure of the data. It clusters the input data into a fixed number of groups. One of the most popular and simple algorithms of this category is the clustering algorithm k-means. Unsupervised algorithms (also known as clustering) don’t need any labels to construct a model. To overcome this problem the idea of this project was to carry out a test using unsupervised algorithms. Such a program might give us the possibility to also detect altered attacks the program hasn’t seen before.Ī problem one will encounter in practice is the lack of labeled network traffic data to train such a model. It automatically finds structures, patterns or rules implicitly given by the data. With machine learning based systems we describe a system or a program which changes automatically when exposed to new data, e.g. Machine learning is a broad field and the term has a lot of different interpretations.

This is where intelligent machine learning-based systems come into play. A rule-based system can’t detect such altered attacks. But attack patterns change, they evolve over time and new patterns appear. If these tools work so well why should we consider using complex machine learning algorithms?Ī rule based system can detect known anomalies. They are updated multiple times a day and detect known anomalies or unusual traffic patterns very well. These signatures are constructed with a lot of domain knowledge and years of experience. There are popular open-source tools like SNORT which find anomalies in firewall traffic based on static rules, so-called signatures.

Feature engineering and anomaly detection approach.