Creating Features in Python using tumbling windows

This article originally discussed "Sliding Windows" but actually refers to a variant called "Tumbling Windows"

The first step to using ML for intrusion analysis detection is the creation of Features that can be used in training and detection.  I write in another blog about creating features from tumbling windows bound aggregates of packet streams. Inbound packets are analyzed and then grouped with other packets that happen near each other. 

We can walk through the steps of GitHub repository contains Python code that creates features from Wireshark/tshark packet streams. The program accepts live tshark output or tshark streams generated from captured .pcap files. 

Network Traffic into Tumbling Windows

The example program requires Python and Wireshark/tshark.  The Python code uses 4 multiprocess tasks making this essentially a 5 core process.  It is a 100% CPU bound on a 4 core machine so I suspect it will run faster on a hex-core or above.

There was a tshark+3 task version that ran 15% faster consuming 85% of a 4 core machine. 

The Python modules/processes communicate via Multiprocessing Queues.

Flow of application located on GitHub

  1. tshark captures live data or replays data from a pcap file. It each packet as a line of text output in their ek format. I chose it because each record is on a single line so now multi-line json assembly is required. The Python processes launch it and listen to standard out.
  2. PacketCapture is a python process that reads tshark and then transforms the data to make it more consumable.  It converts the EK to true JSON and massages some of the label styles to json standard.  The final text is pushed into a message queue
  3. PacketAnalyze accepts the dictionary from the Queue.  It creates a node pair identifier and identifies the protocol and forwards the original data, the id and protocol to the next stage via a Queue.  PacketAnalyze also captures aggregated statistics across the run. Nothing is done with those at this time and they are lost when the program exists.
  4. ServiceIdentity This module reads and ID, Protocol, packet data structure.  It analyzes the packet to identify the higher-level service type of the message.  Examples include DNS, SMTP, FTP, TLS, HTTP, SMB, SMB2, etc.  The service list is added to the incoming data set and sent to a topic.
  5. TimesAndCounts manages the time windows and calculates the time bucket/window statistics and writes them to output.  it reads from the inbound topic and aggregates statistics across a set of incoming packets.  The statistics are retained for a single time window and are written to csv file, with one record for each time window.

Sample output

The final output is a CSV with over 20 columns or features. 

tcp_frame_ln tcp_ip_ln tcp_ln udp_frame_ln udp_ip_ln udp_ln arp_frame_ln num_tls num_http num_ftp num_ssh num_smtp num_dhcp num_dns num_nbns num_smb num_smb2 num_pnrp num_wsdd num_ssdp num_tcp num_udp num_arp num_igmp pairs num_ports num_packets window_end_time
0 0 0 2006 1084 1118 210 0 2 0 0 0 0 16 4 0 0 0 0 0 0 22 5 18 8 14 46 14806
0 0 0 3479 2699 2487 0 0 5 0 0 0 0 6 15 2 0 0 0 0 0 28 0 6 4 8 34 19806
0 0 0 16524 2781 14822 0 0 17 0 0 0 3 4 0 1 0 0 6 0 0 33 0 9 5 13 42 24806
0 0 0 9798 1810 8636 84 0 18 0 0 0 5 0 0 0 0 0 0 0 0 23 2 2 5 7 27 29806
0 0 0 16843 5915 15239 420 0 10 0 0 0 0 12 4 0 0 0 6 0 0 36 10 20 10 14 66 34806
0 0 0 14842 7344 12918 168 0 33 0 0 0 1 10 2 0 0 0 0 0 0 46 4 6 8 12 56 39806
0 0 0 8476 4324 7168 0 0 22 0 0 0 0 2 8 0 0 0 0 0 0 32 0 0 4 7 32 44806
0 0 0 5126 2956 4244 0 0 5 0 0 0 0 6 6 2 0 0 2 0 0 23 0 0 4 11 23 49806
0 0 0 2602 1535 1924 210 0 6 0 0 0 1 2 4 4 0 0 0 0 0 17 5 0 6 10 22 54806
0 0 0 4914 2800 4168 84 0 6 0 0 0 0 3 3 3 0 0 2 0 0 19 2 0 5 12 21 59806
6857 6615 6111 18677 9873 16171 504 0 16 0 0 0 7 21 2 4 0 2 4 0 13 59 12 21 16 33 105 64806
6929 6747 6203 34439 17134 29359 420 0 31 0 0 0 5 23 30 2 0 15 6 0 13 120 10 24 15 36 167 69806
29150 14857 26074 15555 8969 12973 0 0 17 0 0 0 0 13 17 4 0 5 2 0 46 63 0 4 11 24 113 74806
0 0 0 2771 1020 1843 0 0 3 0 0 0 1 10 6 1 0 0 0 0 0 22 0 0 8 14 22 79806
0 0 0 8781 7696 9281 0 0 3 0 0 0 1 2 0 0 0 5 0 0 0 19 0 0 6 11 19 84806

Updating the Program

The program is extensible.  Obvious options include
  1. Add additional service detectors
  2. Output the PacketAnalyze summary data to a file

Video


.pcap Notes

Wireshark changed their labeling on their JSON at some point in the last 3-5 years. You may have problems reading some older pcap files

Alternative Window Strategies

We described the use of non-overlapping time windows (Tumbling Windows). There are also other strategies including using time windows that overlap by various amounts.  This means a given network packet or event contributes to more than one set of features. This approach might limit any time window boundary conditions where a timed attack could hide around time window boundaries.
  • Tumbling Time Window: Fixed time non-overlapping windows that advance at a fixed rate.  Windows have a fixed length. There may or may not be events in a given window. Data can be in only one window.
  • Hopping Window:  The window advances at a specific rate with a specific width.  The window advances irrespective of events received.  This is an overlapping version of the Tumbling Window. Data can be in multiple windows.
  • Time-Based Sliding Window: The events that happened in the last N seconds.  They are triggered every time a new event is received. They are data triggered.  There is at least one event in each window, the trigger event. Data can be in multiple windows.
  • Eviction-Based Sliding Window: The window contains the last N elements. Window length (time) varies based on the event rate.

    References

    • Repository: 
      • Python source code https://github.com/freemansoft/Network-intrusion-dataset-creator This code is 8x faster than the original.
    • Other Blogs and Videos: 
      • Blog: https://joe.blog.freemansoft.com/2021/04/network-intrusion-features-via-sliding.html
      • Blog: https://joe.blog.freemansoft.com/2021/04/creating-features-in-python-using.html
        • Video: https://youtu.be/jKgGh5a5gFA
    • Originating Research 
      • Research paper the original source code was based on. https://www.researchgate.net/profile/Nadun-Rajasinghe/project/A-customizable-Network-Intrusion-Detection-dataset-creating-framework/attachment/5aff08f8b53d2f63c3ccae32/AS:627686015766528@1526663416701/download/1570426776.pdf?context=ProjectUpdatesLog
      • Original Python source repository https://github.com/nrajasin/Network-intrusion-dataset-creator
    • Other
      • https://www.mikulskibartosz.name/difference-between-tumbling-and-sliding-window/
      • https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
      • https://docs.lenses.io/3.2/sql/streaming/windowing.html

    Created 4/2021
    Updated 10/2022

    Comments

    Popular posts from this blog

    Understanding your WSL2 RAM and swap - Changing the default 50%-25%

    Accelerate Storage Spaces with SSDs in Windows 10 Storage Pool tiers

    Java 8 development on Linux/WSL with Visual Studio Code on Windows 10