Showing posts from October, 2022

Time and Count based Tumbling Windows for Network Packet Statistics

Aggregating and analyzing streaming data is one of the ways people build machine learning datasets.  Data is ingested and then data near each other is pushed into aggregations or rows. Aggregations have several attributes or Features . You can think of them as columns in a database or spreadsheet. A data set is made up of many aggregations each one representing some subset of the stream data.  You can think of the aggregations as rows in a spreadsheet.  One of the challenges is picking the right windowing strategy for aggregating or analyzing streaming data. There are a variety of well-known windowing algorithms, Tumbling, Hoping, Sliding, etc. We are using a Tumbling Windows algorithm because of its relative simplicity and low memory usage. Tumbling windows repeat without overlap. Tumbling windows are either size-limited or time-limited. They contain a maximum amount of data or extend for a maximum amount of time. Time-based windowing:  The top row in the pi

Plan ahead for internal correlation and tracing needs

Inter-system and Intra-system tracing capabilities are a must in modern distributed architectures and in systems where dashboards and triage must be done without cracking open the production environment for on-box work. Teams need to understand and verify the lineage of inbound requests that end up in data stores and in outbound calls or notifications to other systems. Lineage and Observability NFRs are the requirement that creates the need for inter and Intra team correlation capabilities.  Everyone should have tracing and monitoring Non-Functional Requirements (NFRs) that describe their observability needs. Those NFRs should describe how a system must support tracking work through a system from the time it enters until the time it transitions to at rest or the time it communicates with other systems.  Teams without these NFRs often end up scrambling to provide production metrics and debugging tools during production events. YouTube Video

When the first work item is gigantic and unpredictable - break it down

  Quit bundling common or platform work with your user features and use cases. Get better flow management and predictability by breaking out unrelated work. Recognize "I'm in there anyway" as an antipattern. Video on YouTube Speaker's Notes To be added

CSV to Markdown is trivial with Python Pandas

Python Pandas are targeted at data science applications but they are useful for everyday data conversion. I needed to convert a wide column sheet of NFRs in a TSV to Markdown for display. It was easy with just 4 lines of code This code  Opens the delimited file,  Fills in all the empty cells with empty strings,  Writes out the .md file df = pd . read_csv ( args .csvFile, engine = "python" , sep = args .sep, header = args .header) with open ( args .mdFile, "w" ) as md :     df . fillna ( "" , inplace = True )     df . to_markdown ( buf = md , index = False ) Usage Example Complete Program w Command Line Arguments __doc__ = """ This converts a delimited csv file to a markdown table using pandas. Run with the -h option to see arguments """ import pandas as pd import argparse csvFileDefault = "NFRs.tsv" mdFileDefault = "" headerRowDefau

Append Only Data Patterns - Cloud Key Value Stores

Cloud Key/Value databases have some interesting features and limitations that can change the way we model our databases. There is a class of key/value stores that have native change feed support that is in a form that is easy to connect to and operate against. In CQRS, we capture an event stream in a primary store and then materialize that in a query store.  An alternative to the CQRS pattern is to create an updated version of a document and then append that updated version of the document to the database.  We're going to look at the drivers and patterns for the latter approach. Video Presentation Content  Speaker's Notes to be added later Published 2022/10