Creating Features in Python using sliding windows
The first step to using ML for intrusion analysis detection is the creation of Features that can be used in training and detection. I talk in another blog about creating features from sliding windows bound statistics of packet streams. We can walk through the steps of GitHub repository contains Python code that creates features from Wireshark/tshark packet streams. The program accepts live tshark output or tshark streams created from .pcap files.
Network Traffic into Sliding Windows
The example program requires Python and Wireshark/tshark. The Python
code uses 4 multiprocess tasks making this essentially a 5 core
process. It is a 100% CPU bound on a 4 core machine so I suspect it
will run faster on a hex-core or above.
There was a tshark+3 task version that ran 15% faster consuming 85% of a 4
core machine.
The Python modules/processes communicate via Multiprocessing Queues.
- tshark captures live data or replays data from a pcap file. It each packet as a line of text output in their ek format. I chose it because each record is on a single line so now multi-line json assembly is required. The Python processes launch it and listen to standard out.
- PacketCapture is a python process that reads tshark and then transforms the data to make it more consumable. It converts the EK to true JSON and massages some of the label styles to json standard. The final text is pushed into a message queue
- PacketAnalyze accepts the dictionary from the Queue. It creates a node pair identifier and identifies the protocol and forwards the original data, the id and protocol to the next stage via a Queue. PacketAnalyze also captures aggregated statistics across the run. Nothing is done with those at this time and they are lost when the program exists.
- ServiceIdentity This module reads and ID, Protocol, packet data structure. It analyzes the packet to identify the higher-level service type of the message. Examples include DNS, SMTP, FTP, TLS, HTTP, SMB, SMB2, etc. The service list is added to the incoming data set and sent to a topic.
- TimesAndCounts manages the time windows and calculates the time bucket/window statistics and writes them to output. it reads from the inbound topic and aggregates statistics across a set of incoming packets. The statistics are retained for a single time window and are written to csv file, one record for each time window.
Sample output
The final output is a CSV with over 20 columns or features.
tcp_frame_ln | tcp_ip_ln | tcp_ln | udp_frame_ln | udp_ip_ln | udp_ln | arp_frame_ln | num_tls | num_http | num_ftp | num_ssh | num_smtp | num_dhcp | num_dns | num_nbns | num_smb | num_smb2 | num_pnrp | num_wsdd | num_ssdp | num_tcp | num_udp | num_arp | num_igmp | pairs | num_ports | num_packets | window_end_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 2006 | 1084 | 1118 | 210 | 0 | 2 | 0 | 0 | 0 | 0 | 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 22 | 5 | 18 | 8 | 14 | 46 | 14806 |
0 | 0 | 0 | 3479 | 2699 | 2487 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 6 | 15 | 2 | 0 | 0 | 0 | 0 | 0 | 28 | 0 | 6 | 4 | 8 | 34 | 19806 |
0 | 0 | 0 | 16524 | 2781 | 14822 | 0 | 0 | 17 | 0 | 0 | 0 | 3 | 4 | 0 | 1 | 0 | 0 | 6 | 0 | 0 | 33 | 0 | 9 | 5 | 13 | 42 | 24806 |
0 | 0 | 0 | 9798 | 1810 | 8636 | 84 | 0 | 18 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 23 | 2 | 2 | 5 | 7 | 27 | 29806 |
0 | 0 | 0 | 16843 | 5915 | 15239 | 420 | 0 | 10 | 0 | 0 | 0 | 0 | 12 | 4 | 0 | 0 | 0 | 6 | 0 | 0 | 36 | 10 | 20 | 10 | 14 | 66 | 34806 |
0 | 0 | 0 | 14842 | 7344 | 12918 | 168 | 0 | 33 | 0 | 0 | 0 | 1 | 10 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 46 | 4 | 6 | 8 | 12 | 56 | 39806 |
0 | 0 | 0 | 8476 | 4324 | 7168 | 0 | 0 | 22 | 0 | 0 | 0 | 0 | 2 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 32 | 0 | 0 | 4 | 7 | 32 | 44806 |
0 | 0 | 0 | 5126 | 2956 | 4244 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 6 | 6 | 2 | 0 | 0 | 2 | 0 | 0 | 23 | 0 | 0 | 4 | 11 | 23 | 49806 |
0 | 0 | 0 | 2602 | 1535 | 1924 | 210 | 0 | 6 | 0 | 0 | 0 | 1 | 2 | 4 | 4 | 0 | 0 | 0 | 0 | 0 | 17 | 5 | 0 | 6 | 10 | 22 | 54806 |
0 | 0 | 0 | 4914 | 2800 | 4168 | 84 | 0 | 6 | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 0 | 0 | 2 | 0 | 0 | 19 | 2 | 0 | 5 | 12 | 21 | 59806 |
6857 | 6615 | 6111 | 18677 | 9873 | 16171 | 504 | 0 | 16 | 0 | 0 | 0 | 7 | 21 | 2 | 4 | 0 | 2 | 4 | 0 | 13 | 59 | 12 | 21 | 16 | 33 | 105 | 64806 |
6929 | 6747 | 6203 | 34439 | 17134 | 29359 | 420 | 0 | 31 | 0 | 0 | 0 | 5 | 23 | 30 | 2 | 0 | 15 | 6 | 0 | 13 | 120 | 10 | 24 | 15 | 36 | 167 | 69806 |
29150 | 14857 | 26074 | 15555 | 8969 | 12973 | 0 | 0 | 17 | 0 | 0 | 0 | 0 | 13 | 17 | 4 | 0 | 5 | 2 | 0 | 46 | 63 | 0 | 4 | 11 | 24 | 113 | 74806 |
0 | 0 | 0 | 2771 | 1020 | 1843 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 10 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 22 | 0 | 0 | 8 | 14 | 22 | 79806 |
0 | 0 | 0 | 8781 | 7696 | 9281 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 19 | 0 | 0 | 6 | 11 | 19 | 84806 |
Updating the Program
The program is extensible. Obvious options include
- Add additional service detectors
- Output the PacketAnalyze summary data to a file
Video
.pcap Notes
Wireshark changed their labeling on their JSON at some point in the last
3-5 years. You may have problems reading some older pcap files
References
- Repository:
- Python source code https://github.com/freemansoft/Network-intrusion-dataset-creator This code is 8x faster than the original.
- Other Blogs and Videos:
- Blog: https://joe.blog.freemansoft.com/2021/04/network-intrusion-features-via-sliding.html
- Video: https://youtu.be/b3MaxbAAdDw
- Blog: https://joe.blog.freemansoft.com/2021/04/creating-features-in-python-using.html
- Video: https://youtu.be/jKgGh5a5gFA
- Originating Research
- Research paper the original source code was based on. https://www.researchgate.net/profile/Nadun-Rajasinghe/project/A-customizable-Network-Intrusion-Detection-dataset-creating-framework/attachment/5aff08f8b53d2f63c3ccae32/AS:627686015766528@1526663416701/download/1570426776.pdf?context=ProjectUpdatesLog
- Original Python source repository https://github.com/nrajasin/Network-intrusion-dataset-creator
Comments
Post a Comment