Posts

Loading both Cloud Data Lake and Warehouse

Image
Let's map out how data can flow from the originating  Operational Store  to both the  Data Lake  and  Data Warehouse.  We have to decide if the  Data Warehouse  and  Data Lake  are  peers  or if one is  the gold source  and the other is a  copy .  Internet web and connected applications have created a data explosion.  Cheap storage and unlimited computing power are empowering new use cases like ML and revolutionizing old ones like CRM and CDP. Hive and Hadoop ushered in the age of big data. Data used to exist in two locations, operational and reporting databases. Now all data of all types can be collected into a single multi-petabyte Data Lake without expensive custom hardware. Business requirements and regulatory needs should drive your design. The top diagram shows the originating systems loading the Data Lake  and the Data Warehouse  in parallel as peers.  The second diagram shows the Data Lake  as the location for all data with some of that data replic

Call Recordings and other Binary Data and Metadata in the Data Lake

Image
Data lakes hold data in any format.  This includes structured data, semi-structured text data, documents, and binary data. Organizing that binary data and its metadata can be done in several ways. Video Images in Video Welcome We're talking about binary data and its associated descriptive metadata.   This shows some of the metadata that could be associated with each call recording. The recording itself is  highly sensitive  because we don't know exactly what was said.   The extracted text is also highly sensitive  because it is a full text copy with the same risk. Media / binary files can add up.   We could have millions of call records and all of their associated metadata. It is a large data problem. We have to pick the format for the binary, non-rectangular, data and its associated metadata. We can use the native formats and links or embed the binary data inside another format. Here are two of the major options. Bin

Cloud Data Lake vs Warehouse - fit for purpose

Image
Data Lakes and Data Warehouses each have their own strengths and weaknesses.  You may need one or the other depending on your needs. Look at your use cases to determine whether it makes to have one or the other or both.  Maybe this can help you with more things to think about when making a decision of one over the other. My general experience has been  Data Lakes tend to be the choice when feeding operational systems and when storing binary data.  They are often used for massive data transformations or ML Feature creation. Sometimes security concerns and partitions may drive highly sensitive data to protected lakes. Data Warehouses tend to be the choice when humans need big data for reporting, data exploration, and collaborative environments. Use cases that put them in the middle of data flows for operational systems should be evaluated for uptime and latency. Different companies will prioritize differently.  I've seen companies that were lake only , companies that had

Streaming Ecosystems Still Need Extract and Load

Image
Enterprises move from batch to streaming data ingestion in order to make data available in a more near time  manner. This does not remove the need for extract and load capabilities.  Streaming systems only operate on data that is in the stream  right now .  There is no data available from a time outside of the retention window or from prior to system implementation.  There is a whole other set of lifecycle operations that require some type of bulk operations. Examples include: Initial data loads where data was collected prior or outside of streaming processing. Original event streams may need to be re-ingested because they were mis-processed or because you may wish to extract the data differently. Original event streams fixed/ modified and re-ingested in order to fix errors or add information in the operational store. Privacy and retention rules may require the generation of synthetic events to make data change

What do we want out of load or performance test?

Image
We use performance tests to verify the raw throughput of some subsystems and to verify the overall impact some subsystem has on an entire ecosystem.  Load tests act as documentation for performance indicators and re-enforce performance expectations. They are vital in identifying performance regression. Load and performance tests are an often overlooked part of the software release lifecycle.  Load tests, at their most basic level, are about stress testing a system by dropping a lot of work onto it. Sometimes it is a percentage of expected load, other times it is the expected load, and other times it is future expected levels of load.  A failure to test  expected  near-term load can lead to spectacular public failures. Video Measurements  Your  business requirements  determine requirements for throughput, latency. Your  financial requirements impact the choices  you take towards achieving those goals

Creating Features in Python using sliding windows

Image
The first step to using ML for intrusion analysis detection is the creation of Features that can be used in training and detection.  I talk in  another blog  about creating features from sliding windows bound statistics of packet streams.  We can walk through the steps of   GitHub repository   contains Python code that creates features from Wireshark/tshark packet streams. The program accepts live tshark output or tshark streams created from .pcap files.  Network Traffic into Sliding Windows The example program requires Python and Wireshark/tshark.  The Python code uses 4 multiprocess tasks making this essentially a 5 core process.  It is a 100% CPU bound on a 4 core machine so I suspect it will run faster on a hex-core or above. There was a tshark+3 task version that ran 15% faster consuming 85% of a 4 core machine.  The Python modules/processes communicate via Multiprocessing Queues.

Network Intrusion Features via Sliding Time Windows

Image
Feature creation is one of the first steps towards creating Machine Models that apply to network monitoring or other stream-oriented data processes.  We massage independent variables into a form that can be used by ML models or other statistical tools. This often involves transforming source data through numerical conversion, bucketing, aggregation, and other techniques. For this project, we'd like to try and train a machine model to detect intrusion events by having it look at network traffic. People sometimes try and  directly consume events  as inputs. An individual network packet does not contain enough context to be useful on its own. A sliding time window makes it possible to create features with more context than you would get with a single message. This GitHub repository contains Python code that creates features from Wireshark/tshark packet streams. The program accepts live tshark output or tshark streams created from .pcap files.

Visualizing Covid Vaccinations - Python data prep and steps

Image
We want to plot the Covid vaccination rates across different countries world-wide or different states in the USA. We need to create a standardized dataset that is accurate enough for our graphing purposes. The folks at Our World in Data (OWiD) gather that information to create composite data sets.  Each independent entity reports data on its own schedule.  The composite  dataset can be missing entire days of data for some entities or individual data attributes in some of the days that are actually reported.   Lets look at the steps required to create reasonable comparisons and progress graphics. Source Data and Code Dataset courtesy of  Our World in Data : GitHub Repository Python code and scripts described here are available  on GitHub Videos  links at the bottom of this article Data Consistency We want time-series data that lets us exactly line up the data for each reporter. This table shows two different countries, C1 and C2.  They each report data on their own schedules.   C1 does

Monitor Internet Broadband service with a Raspberry Pi 4 and some Python

Image
You can easily automate capturing broadband connection statistics with some Python code  running on a Raspberry Pi, a Mac or, a PC.  I used a Raspberry Pi 4 as my test appliance because it is cheap and can support 1GB/s ethernet connections. That means it is fast enough to service most residential or low-end commercial connections. I'm lazy and wanted the data to end up in a secure public cloud that could be populated and viewed from anywhere.  We can send our broadband statistics from 1 or more locations and graph the different locations against each other. Any tool could be used. Monitoring One or Compare Two  We wanted to compare two different internet provider's service levels.  One provider is a FIOS 1GB down / 1GB up.  The other is a cable service with 1GB down / 50MB up. The providers and the technology were different.  We wanted to know if the complaints about one of the providers were valid. Relies on Speedtest.net infrastructure We're going to leverage the popular

Querying Python logs Azure Application Insight

Image
You can send your Python logs to Azure Application Insights from anywhere and then leverage the Application Insights query and dashboard capabilities to do log analysis.  Getting access to the logs is trivial. I wanted to plot basic internet performance information from data generated from two different machines in two different locations.  The source code is on GitHub here  freemansoft/speedtest-app-insights . That project runs speedtest.net measurements and then posts them to Azure Application Insights.  It logs the raw data when the --verbose switch is set.  That verbose output is sent to Azure App Insights. Execution pre-requisites You have an Azure login You have created an Azure Application Insights Application key https://docs.microsoft.com/en-us/azure/azure-monitor/app/create-new-resource You have pushed data to Application Insights.  I used https://github.com/freemansoft/speedtest-app-insights with the _--verbose__ switch Video walkthrough not yet available Data Capture Notes