Creating Features in Python using tumbling windows
This article originally discussed "Sliding Windows" but actually refers to a variant called "Tumbling Windows"
The first step to using ML for intrusion analysis detection is the creation of Features that can be used in training and detection. I write in another blog about creating features from tumbling windows bound aggregates of packet streams. Inbound packets are analyzed and then grouped with other packets that happen near each other.
We can walk through the steps of GitHub repository contains Python code that creates features from Wireshark/tshark packet streams. The program accepts live tshark output or tshark streams generated from captured .pcap files.
Network Traffic into Tumbling Windows
The example program requires Python and Wireshark/tshark. The Python
code uses 4 multiprocess tasks making this essentially a 5 core
process. It is a 100% CPU bound on a 4 core machine so I suspect it
will run faster on a hex-core or above.
There was a tshark+3 task version that ran 15% faster consuming 85% of a 4
core machine.
The Python modules/processes communicate via Multiprocessing Queues.
- tshark captures live data or replays data from a pcap file. It each packet as a line of text output in their ek format. I chose it because each record is on a single line so now multi-line json assembly is required. The Python processes launch it and listen to standard out.
- PacketCapture is a python process that reads tshark and then transforms the data to make it more consumable. It converts the EK to true JSON and massages some of the label styles to json standard. The final text is pushed into a message queue
- PacketAnalyze accepts the dictionary from the Queue. It creates a node pair identifier and identifies the protocol and forwards the original data, the id and protocol to the next stage via a Queue. PacketAnalyze also captures aggregated statistics across the run. Nothing is done with those at this time and they are lost when the program exists.
- ServiceIdentity This module reads and ID, Protocol, packet data structure. It analyzes the packet to identify the higher-level service type of the message. Examples include DNS, SMTP, FTP, TLS, HTTP, SMB, SMB2, etc. The service list is added to the incoming data set and sent to a topic.
- TimesAndCounts manages the time windows and calculates the time bucket/window statistics and writes them to output. it reads from the inbound topic and aggregates statistics across a set of incoming packets. The statistics are retained for a single time window and are written to csv file, with one record for each time window.
Sample output
The final output is a CSV with over 20 columns or features.
tcp_frame_ln | tcp_ip_ln | tcp_ln | udp_frame_ln | udp_ip_ln | udp_ln | arp_frame_ln | num_tls | num_http | num_ftp | num_ssh | num_smtp | num_dhcp | num_dns | num_nbns | num_smb | num_smb2 | num_pnrp | num_wsdd | num_ssdp | num_tcp | num_udp | num_arp | num_igmp | pairs | num_ports | num_packets | window_end_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 2006 | 1084 | 1118 | 210 | 0 | 2 | 0 | 0 | 0 | 0 | 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 22 | 5 | 18 | 8 | 14 | 46 | 14806 |
0 | 0 | 0 | 3479 | 2699 | 2487 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 6 | 15 | 2 | 0 | 0 | 0 | 0 | 0 | 28 | 0 | 6 | 4 | 8 | 34 | 19806 |
0 | 0 | 0 | 16524 | 2781 | 14822 | 0 | 0 | 17 | 0 | 0 | 0 | 3 | 4 | 0 | 1 | 0 | 0 | 6 | 0 | 0 | 33 | 0 | 9 | 5 | 13 | 42 | 24806 |
0 | 0 | 0 | 9798 | 1810 | 8636 | 84 | 0 | 18 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 23 | 2 | 2 | 5 | 7 | 27 | 29806 |
0 | 0 | 0 | 16843 | 5915 | 15239 | 420 | 0 | 10 | 0 | 0 | 0 | 0 | 12 | 4 | 0 | 0 | 0 | 6 | 0 | 0 | 36 | 10 | 20 | 10 | 14 | 66 | 34806 |
0 | 0 | 0 | 14842 | 7344 | 12918 | 168 | 0 | 33 | 0 | 0 | 0 | 1 | 10 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 46 | 4 | 6 | 8 | 12 | 56 | 39806 |
0 | 0 | 0 | 8476 | 4324 | 7168 | 0 | 0 | 22 | 0 | 0 | 0 | 0 | 2 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 32 | 0 | 0 | 4 | 7 | 32 | 44806 |
0 | 0 | 0 | 5126 | 2956 | 4244 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 6 | 6 | 2 | 0 | 0 | 2 | 0 | 0 | 23 | 0 | 0 | 4 | 11 | 23 | 49806 |
0 | 0 | 0 | 2602 | 1535 | 1924 | 210 | 0 | 6 | 0 | 0 | 0 | 1 | 2 | 4 | 4 | 0 | 0 | 0 | 0 | 0 | 17 | 5 | 0 | 6 | 10 | 22 | 54806 |
0 | 0 | 0 | 4914 | 2800 | 4168 | 84 | 0 | 6 | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 0 | 0 | 2 | 0 | 0 | 19 | 2 | 0 | 5 | 12 | 21 | 59806 |
6857 | 6615 | 6111 | 18677 | 9873 | 16171 | 504 | 0 | 16 | 0 | 0 | 0 | 7 | 21 | 2 | 4 | 0 | 2 | 4 | 0 | 13 | 59 | 12 | 21 | 16 | 33 | 105 | 64806 |
6929 | 6747 | 6203 | 34439 | 17134 | 29359 | 420 | 0 | 31 | 0 | 0 | 0 | 5 | 23 | 30 | 2 | 0 | 15 | 6 | 0 | 13 | 120 | 10 | 24 | 15 | 36 | 167 | 69806 |
29150 | 14857 | 26074 | 15555 | 8969 | 12973 | 0 | 0 | 17 | 0 | 0 | 0 | 0 | 13 | 17 | 4 | 0 | 5 | 2 | 0 | 46 | 63 | 0 | 4 | 11 | 24 | 113 | 74806 |
0 | 0 | 0 | 2771 | 1020 | 1843 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 10 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 22 | 0 | 0 | 8 | 14 | 22 | 79806 |
0 | 0 | 0 | 8781 | 7696 | 9281 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 19 | 0 | 0 | 6 | 11 | 19 | 84806 |
Updating the Program
The program is extensible. Obvious options include
- Add additional service detectors
- Output the PacketAnalyze summary data to a file
Video
.pcap Notes
Wireshark changed their labeling on their JSON at some point in the last
3-5 years. You may have problems reading some older pcap files
Alternative Window Strategies
We described the use of non-overlapping time windows (Tumbling Windows). There are also other strategies including using time windows that overlap by various amounts. This means a given network packet or event contributes to more than one set of features. This approach might limit any time window boundary conditions where a timed attack could hide around time window boundaries.
- Tumbling Time Window: Fixed time non-overlapping windows that advance at a fixed rate. Windows have a fixed length. There may or may not be events in a given window. Data can be in only one window.
- Hopping Window: The window advances at a specific rate with a specific width. The window advances irrespective of events received. This is an overlapping version of the Tumbling Window. Data can be in multiple windows.
- Time-Based Sliding Window: The events that happened in the last N seconds. They are triggered every time a new event is received. They are data triggered. There is at least one event in each window, the trigger event. Data can be in multiple windows.
- Eviction-Based Sliding Window: The window contains the last N elements. Window length (time) varies based on the event rate.
References
- Repository:
- Python source code https://github.com/freemansoft/Network-intrusion-dataset-creator This code is 8x faster than the original.
- Other Blogs and Videos:
- Blog: https://joe.blog.freemansoft.com/2021/04/network-intrusion-features-via-sliding.html
- Video: https://youtu.be/b3MaxbAAdDw
- Blog: https://joe.blog.freemansoft.com/2021/04/creating-features-in-python-using.html
- Video: https://youtu.be/jKgGh5a5gFA
- Originating Research
- Research paper the original source code was based on. https://www.researchgate.net/profile/Nadun-Rajasinghe/project/A-customizable-Network-Intrusion-Detection-dataset-creating-framework/attachment/5aff08f8b53d2f63c3ccae32/AS:627686015766528@1526663416701/download/1570426776.pdf?context=ProjectUpdatesLog
- Original Python source repository https://github.com/nrajasin/Network-intrusion-dataset-creator
- Other
- https://www.mikulskibartosz.name/difference-between-tumbling-and-sliding-window/
- https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
- https://docs.lenses.io/3.2/sql/streaming/windowing.html
Created 4/2021
Updated 10/2022
Comments
Post a Comment