Creating Features in Python using sliding windows

The first step to using ML for intrusion analysis detection is the creation of Features that can be used in training and detection.  I talk in another blog about creating features from sliding windows bound statistics of packet streams.  We can walk through the steps of GitHub repository contains Python code that creates features from Wireshark/tshark packet streams. The program accepts live tshark output or tshark streams created from .pcap files. 

Network Traffic into Sliding Windows

The example program requires Python and Wireshark/tshark.  The Python code uses 4 multiprocess tasks making this essentially a 5 core process.  It is a 100% CPU bound on a 4 core machine so I suspect it will run faster on a hex-core or above.

There was a tshark+3 task version that ran 15% faster consuming 85% of a 4 core machine. 

The Python modules/processes communicate via Multiprocessing Queues.

Flow of application located on GitHub

  1. tshark captures live data or replays data from a pcap file. It each packet as a line of text output in their ek format. I chose it because each record is on a single line so now multi-line json assembly is required. The Python processes launch it and listen to standard out.
  2. PacketCapture is a python process that reads tshark and then transforms the data to make it more consumable.  It converts the EK to true JSON and massages some of the label styles to json standard.  The final text is pushed into a message queue
  3. PacketAnalyze accepts the dictionary from the Queue.  It creates a node pair identifier and identifies the protocol and forwards the original data, the id and protocol to the next stage via a Queue.  PacketAnalyze also captures aggregated statistics across the run. Nothing is done with those at this time and they are lost when the program exists.
  4. ServiceIdentity This module reads and ID, Protocol, packet data structure.  It analyzes the packet to identify the higher-level service type of the message.  Examples include DNS, SMTP, FTP, TLS, HTTP, SMB, SMB2, etc.  The service list is added to the incoming data set and sent to a topic.
  5. TimesAndCounts manages the time windows and calculates the time bucket/window statistics and writes them to output.  it reads from the inbound topic and aggregates statistics across a set of incoming packets.  The statistics are retained for a single time window and are written to csv file, one record for each time window.

Sample output

The final output is a CSV with over 20 columns or features. 

tcp_frame_ln tcp_ip_ln tcp_ln udp_frame_ln udp_ip_ln udp_ln arp_frame_ln num_tls num_http num_ftp num_ssh num_smtp num_dhcp num_dns num_nbns num_smb num_smb2 num_pnrp num_wsdd num_ssdp num_tcp num_udp num_arp num_igmp pairs num_ports num_packets window_end_time
0 0 0 2006 1084 1118 210 0 2 0 0 0 0 16 4 0 0 0 0 0 0 22 5 18 8 14 46 14806
0 0 0 3479 2699 2487 0 0 5 0 0 0 0 6 15 2 0 0 0 0 0 28 0 6 4 8 34 19806
0 0 0 16524 2781 14822 0 0 17 0 0 0 3 4 0 1 0 0 6 0 0 33 0 9 5 13 42 24806
0 0 0 9798 1810 8636 84 0 18 0 0 0 5 0 0 0 0 0 0 0 0 23 2 2 5 7 27 29806
0 0 0 16843 5915 15239 420 0 10 0 0 0 0 12 4 0 0 0 6 0 0 36 10 20 10 14 66 34806
0 0 0 14842 7344 12918 168 0 33 0 0 0 1 10 2 0 0 0 0 0 0 46 4 6 8 12 56 39806
0 0 0 8476 4324 7168 0 0 22 0 0 0 0 2 8 0 0 0 0 0 0 32 0 0 4 7 32 44806
0 0 0 5126 2956 4244 0 0 5 0 0 0 0 6 6 2 0 0 2 0 0 23 0 0 4 11 23 49806
0 0 0 2602 1535 1924 210 0 6 0 0 0 1 2 4 4 0 0 0 0 0 17 5 0 6 10 22 54806
0 0 0 4914 2800 4168 84 0 6 0 0 0 0 3 3 3 0 0 2 0 0 19 2 0 5 12 21 59806
6857 6615 6111 18677 9873 16171 504 0 16 0 0 0 7 21 2 4 0 2 4 0 13 59 12 21 16 33 105 64806
6929 6747 6203 34439 17134 29359 420 0 31 0 0 0 5 23 30 2 0 15 6 0 13 120 10 24 15 36 167 69806
29150 14857 26074 15555 8969 12973 0 0 17 0 0 0 0 13 17 4 0 5 2 0 46 63 0 4 11 24 113 74806
0 0 0 2771 1020 1843 0 0 3 0 0 0 1 10 6 1 0 0 0 0 0 22 0 0 8 14 22 79806
0 0 0 8781 7696 9281 0 0 3 0 0 0 1 2 0 0 0 5 0 0 0 19 0 0 6 11 19 84806

Updating the Program

The program is extensible.  Obvious options include
  1. Add additional service detectors
  2. Output the PacketAnalyze summary data to a file

Video


.pcap Notes

Wireshark changed their labeling on their JSON at some point in the last 3-5 years. You may have problems reading some older pcap files

    References

    • Repository: 
      • Python source code https://github.com/freemansoft/Network-intrusion-dataset-creator This code is 8x faster than the original.
    • Other Blogs and Videos: 
      • Blog: https://joe.blog.freemansoft.com/2021/04/network-intrusion-features-via-sliding.html
      • Blog: https://joe.blog.freemansoft.com/2021/04/creating-features-in-python-using.html
        • Video: https://youtu.be/jKgGh5a5gFA
    • Originating Research 
      • Research paper the original source code was based on. https://www.researchgate.net/profile/Nadun-Rajasinghe/project/A-customizable-Network-Intrusion-Detection-dataset-creating-framework/attachment/5aff08f8b53d2f63c3ccae32/AS:627686015766528@1526663416701/download/1570426776.pdf?context=ProjectUpdatesLog
      • Original Python source repository https://github.com/nrajasin/Network-intrusion-dataset-creator

    Comments

    Popular posts from this blog

    Accelerate Storage Spaces with SSDs in Windows 10 Storage Pool tiers

    Docker on a Chromebook on Crostini - Neverware CloudReady is ready

    Java 8 development on Linux/WSL with Visual Studio Code on Windows 10