Time and Count based Tumbling Windows for Network Packet Statistics
Aggregating and analyzing streaming data is one of the ways people build machine learning datasets. Data is ingested and then data near each other is pushed into aggregations or rows. Aggregations have several attributes or Features. You can think of them as columns in a database or spreadsheet. A data set is made up of many aggregations each one representing some subset of the stream data. You can think of the aggregations as rows in a spreadsheet.
One of the challenges is picking the right windowing strategy for aggregating or analyzing streaming data. There are a variety of well-known windowing algorithms, Tumbling, Hoping, Sliding, etc. We are using a Tumbling Windows algorithm because of its relative simplicity and low memory usage. Tumbling windows repeat without overlap. Tumbling windows are either size-limited or time-limited. They contain a maximum amount of data or extend for a maximum amount of time.
- Time-based windowing: The top row in the picture shows a series of time-based windows. Windows are time scoped. There can be any number of packets in each window. Every window covers the same length of time.
- The diagram above shows 3 5-second windows with varying numbers of events.
- Count-based windowing: The middle row in the picture shows a series of count-based windows. Windows are content scoped, filling when they reach their max size. The windows can extend for any amount of time. Every window contains the same number of data points.
- The diagram above shows 4 3-packet windows that span different amounts of time.
- Time-based and count-based windowing: The bottom row in the picture demonstrates a multi-variant approach. Windows are filled when they reach max size or exceed a time period. This means windows contain either the specified number of packets or span the specified time.
- The diagram above shows 3 3-packet windows and 1 5-second window where only two packets appeared before the window TTL expired.
YouTube Video
to be created
Streaming Tumbling Windows in Dataset Creator
- Streaming app that accepts tshark output and genrates windowed statistics for Machine Learning specified by time or packet counts https://github.com/freemansoft/Network-intrusion-dataset-creator
- Tumbling windows vs other window types https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
Examples
These results were not based on the diagram above. They were
generated using `tshark` sample data available
from https://wiki.wireshark.org/SampleCaptures .
The sample data set has 38 packets that start at 11:31:42 and end at
11:32:41. The sample data has about a 1-minute time span.
5-Second Tumbling window
The capture buckets are sequential 5-second windows. We have about
60 seconds of data so we will end up with 12 5-second buckets. The first
and the last buckets always have data.
There are 5-second periods where there is no traffic resulting in empty
buckets
~/Network-intrusion-dataset-creator$ python3 main.py --sourcefile smtp-ssl.pcapng -wt 5000
1 packetCount: 21 startTime: 11:31:42.005000 endTime: 11:31:42.450000
2 packetCount: 0 startTime: 11:31:47.005000 endTime: 11:31:47.005000
3 packetCount: 0 startTime: 11:31:52.005000 endTime: 11:31:52.005000
4 packetCount: 4 startTime: 11:31:57.005000 endTime: 11:31:58.335000
5 packetCount: 0 startTime: 11:32:02.005000 endTime: 11:32:02.005000
6 packetCount: 0 startTime: 11:32:07.005000 endTime: 11:32:07.005000
7 packetCount: 0 startTime: 11:32:12.005000 endTime: 11:32:12.005000
8 packetCount: 0 startTime: 11:32:17.005000 endTime: 11:32:17.005000
9 packetCount: 0 startTime: 11:32:22.005000 endTime: 11:32:22.005000
10 packetCount: 4 startTime: 11:32:27.005000 endTime: 11:32:29.517000
11 packetCount: 0 startTime: 11:32:32.005000 endTime: 11:32:32.005000
12 packetCount: 9 startTime: 11:32:37.005000 endTime: 11:32:41.025000
10-Second Tumbling Window
The capture buckets are sequential 10-second windows. We have
about 60 seconds of data so there will be 6 10-second buckets. The
first and the last buckets always have data.
There are 10-second periods where there is no traffic resulting in
empty buckets
~/Network-intrusion-dataset-creator$ python3 main.py --sourcefile smtp-ssl.pcapng -wt 10000
1 packetCount: 21 startTime: 11:31:42.005000 endTime: 11:31:42.450000
2 packetCount: 4 startTime: 11:31:52.005000 endTime: 11:31:58.335000
3 packetCount: 0 startTime: 11:32:02.005000 endTime: 11:32:02.005000
4 packetCount: 0 startTime: 11:32:12.005000 endTime: 11:32:12.005000
5 packetCount: 4 startTime: 11:32:22.005000 endTime: 11:32:29.517000
6 packetCount: 9 startTime: 11:32:32.005000 endTime: 11:32:41.025000
4-Count Tumbling with 10,000 msec guard
The capture buckets are allowed to contain up to 4 packets. An
empty bucket is created if there are 10 seconds without the minimum 4
packets. The first and the last buckets always have data.
The 4-packet window size means that the window length can vary in time. Some of the windows are very short which changes the next window's start time. Only one window timed out before collecting its 4 packets
~/Network-intrusion-dataset-creator$ python3 main.py --sourcefile smtp-ssl.pcapng -wp 4 -wt 10000
1 packetCount: 4 startTime: 11:31:42.005000 endTime: 11:31:42.089000
2 packetCount: 4 startTime: 11:31:42.089000 endTime: 11:31:42.132000
3 packetCount: 4 startTime: 11:31:42.132000 endTime: 11:31:42.212000
4 packetCount: 4 startTime: 11:31:42.212000 endTime: 11:31:42.309000
5 packetCount: 4 startTime: 11:31:42.309000 endTime: 11:31:42.450000
6 packetCount: 1 startTime: 11:31:42.450000 endTime: 11:31:42.450000
7 packetCount: 4 startTime: 11:31:52.450000 endTime: 11:31:58.335000
8 packetCount: 4 startTime: 11:32:29.474000 endTime: 11:32:29.517000
9 packetCount: 4 startTime: 11:32:40.938000 endTime: 11:32:41.025000
10 packetCount: 4 startTime: 11:32:41.025000 endTime: 11:32:41.025000
11 packetCount: 1 startTime: 11:32:41.025000 endTime: 11:32:41.025000 20 Count Tumbling with 10,000 msec guard
The capture buckets are allowed to contain up to 20 packets. An
empty bucket is created if there are 10 seconds without the minimum 20
packets. The first and the last buckets always have data.
The 20-packet window size means many of the intervals will timeout unlike the 4-packet window in the previous section. There is a lot of initial traffic which dumps most of the packets into the first bucket leaving less than 20. This means there will be a single window with the reaming packets or multiple windows that timed out because of the low traffic.
~/Network-intrusion-dataset-creator$ python3 main.py --sourcefile smtp-ssl.pcapng -wp 20 -wt 10000
1 packetCount: 20 startTime: 11:31:42.005000 endTime: 11:31:42.450000
2 packetCount: 1 startTime: 11:31:42.450000 endTime: 11:31:42.450000
3 packetCount: 4 startTime: 11:31:52.450000 endTime: 11:31:58.335000
4 packetCount: 0 startTime: 11:32:02.450000 endTime: 11:32:02.450000
5 packetCount: 0 startTime: 11:32:12.450000 endTime: 11:32:12.450000
6 packetCount: 4 startTime: 11:32:22.450000 endTime: 11:32:29.517000
7 packetCount: 9 startTime: 11:32:32.450000 endTime: 11:32:41.025000
Wrap Up
See the repo for code that demonstrates this.
Created 2022 10
Comments
Post a Comment