Posts

Showing posts from 2019

What do you know and why do you know it - Lineage for ML

Image
Process and decision repeatability and accountability is a requirement for large enterprises and entities operated in regulated industries.  Machine Learning decision justification and auditability and privacy related data tracking are two areas pushing organizations to improve the way they track data movement, transformation and usage.  This drives the need for  Data Lineage  tracking and reporting. Organizations have to trade off the  ease of creating and capturing data lineage,    the amount of data captured  and the  ease of reporting and auditability   Data lineage includes the data origin, what happens to it and where it moves over time. [1]  Data lineage information includes technical metadata involving data transformations .   [2] This diagram shows a simple data movement where data originates in one system, is transformed, stored in a database, then transformed again and used by a machine mode.  The resulting calculation is then stored again. The small circles call out what

Run applications on a Mac using ASP.NET Core and with MongoDB on Docker

Image
It is easy to develop ASP.NET Core applications on a Mac with VSCode, and run them while they rely Docker based data services. The first part is easy because of ASP.NET core's cross platform compatibility. The is very simple even though it sounds impressive.  Services like MongoDB expose the same ports whether they are Docker deployed or directly installed to the operating system. Our application is an ASP.NET core micro-service example put out by Microsoft. The bookstore REST service API support implements basic CRUD operations. The net result is an ASP.NET web service that uses MongoDB as its' persistent store. I build this on a Mac because both .NetCore is cross platform and MongoDB is running in a Linux Docker container isolated from the underlying system. You should be able to run this without changes on a PC without modification. We are using Visual Studio code because it is cross platform and runs on Mac, PC and Linux.  I have run it in the Linux Subsystem on Ch

Have the team tell you who is important - with an incentive

Image
We wanted our 25 person team tell us who delivered value to them .  The one-time exercise was totally opened to being gamed and manipulated as are all systems.  We attempted to limit risk by keeping the stakes low.   The gift card experiment Proposal: We had a pile of $10 gift cards targeted to be used for incentives.  We gave everyone two cards.  They were to keep one as a reward and give away the other as a thank you to someone else for their help.   Process: Every person was given two gift cards. Each person kept one. This meant no one walked away empty handed. Each person had 7 days to give one away as a thank you for that person's help during the year. Supervisors and team leads were excluded. I did an informal survey to find out who people gave their thank you card to. No records were kept Results A couple people kept there give away cards.  This was disappointing but not a surprise. About 1/3 of the earmarked cards went to one of our team'

Quit worrying and love VMs and Containers

Image
Did you ever wake up, look at your development box and wonder "when did that happen"?  I've started using Docker for deployed services like databases, messages brokers etc.  At the same time, I've been trying to use Kali Linux for hackathons and general security work.   Windows gaming, my pathetic mobile efforts and windows docker development are done using Windows 10.  Windows must run under a Hypervisor or as a dual boot.  That is how you end up with three hypervisors, three operating systems and two docker environments on the same machine. The following diagram shows the underlying complexity of all this. Hypervisors in action HyperKit on OS/X:  Docker for Mac desktop runs docker containers inside a HyperKit virtual machine that leverages the Mac OS/X Hypervisor.framework.  Docker named drives live inside this virtual machine. VMWare Fusion on OS/X:  VMWare Fusion can host Windows and Linux virtual machines. Fusion supports nested hypervisors whic

Data Lake - getting data into the zone

Image
Data lakes exist to store and expose data in its native format without size or format constraints. Cloud data storage makes it possible to store large amounts of data without worrying about costs or data loss.  Corporate lakes often store the same data multiple in transformed or enriched formats making them easier to use.   My last two employers each had over 20 Petabytes of data in their lakes. A well-managed lake organizes data based on usage, data quality, data trust levels, governance policies, data sensitivity and information lifecycles. Lake architects can spread their data across horizontal zones for purpose and/or vertical organization zones .   The actual zones for purpose vary by industry or company. Zone Based Data Organization This diagram demonstrates a zone structure that might be fit for a financial services company.  It assumes that company generates its' own data and receives data from external organizations.  Data exists in unstructured, semi-structu

Data Lakes are not just for squares

Image
Columnar-only lakes are just another warehouse Data Lakes are intended to be a source of truth for their varied data. Some organizations restrict their lake to columnar data, violating one of the main precepts behind Data Lakes. They limit data lake to be used for large data set transformations or automated analytics. This limiting definition leaves those companies without anywhere to store a significant subset of their total data pool data. Data Lakes are not restricted Data lakes hold data in its' original data format to retain data fidelity. All data sets retain their original structure, data types and raw data format. Some enterprise data lakes make the data more usable  by storing the same data in multiple formats , the original format and a more queryable, accessible format.  This approach exactly preserves the original data while making more accessible. Examples of multiple-copy same-data storage include. CSV and other data that is also stored in directly

Machine Intelligence Feature Flow

Image
What is a Feature? A feature is data that has been prepared to be used as input to a Machine mode.  The feature can be a data set or scalar value or an aggregation. It is created by transforming, categorizing or aggregating original source data.  Features can be created and used in almost any type of application, and can be calculated a priori or calculated as part of model execution. What is an Enterprise Feature? ML/AI model usage in regulated industries often includes proof of data lineage used in training the model and in feeding the model in production. The models themselves must often be registered as they are trained and retained for audit purposes.  The retained features and retained models can be used later for bias or fraud investigations as part of the normal regulated industry audit process. An enterprise feature is a feature that meets regulatory, legal and compliance requirements required in regulated industries.  Data, and transformation registrations and an

Can federal programs really be Agile when multiple firms are involved?

Transparency is one of the core pillars of the Agile mindset. Transparency exposes issues earlier making it possible to address them in a move left fashion. Transparency is critical to the success of Agile and is one of the Agile tenets that is hardest to implement in large enterprises and federal projects. The Federal contract / project cycle is designed to use competition to reduce cost and fraud risk. One of the unintended consequences of this is that the competitive process punishes transparency and rewards those that let their partners fail. Federal projects don't die. They just move to the next phase as part of another bid process. This means contracting companies work on the the project for the government while working on securing the next bid round by working for themselves . Federal contracts involving multiple partners and sub-partners punish transparency and encourage companies to let their partners fail to secure better positions in future phases o

No hack required for Linux on Chromebooks with the Termina VM and containers or Virtualbox

Image
Chromebooks have a security model that traditional laptop OS makes are still struggling to broadly implement. Chrome OS (Chromebook) is one of the more secure platforms for web browsing and web applications restricting users to a limited set of high level APIs. Power users   and developers often belittle the system because they are unable to install and run arbitrary applications. The ChromeOS and Chromium teams have resisted unbridled execution of the Linux programs because that would weaken the security profile of Chrome based devices.  ChromeOS/Chromium  as now addressed this issue by providing secure sandboxed environments that execute Linux programs that are highly isolated from the Chrome operating system. Chromebooks, including CloudReady devices, now support isolated  Crostini  Linux containers with only a single preference setting.  Crostini Linux runs in a sandboxed Linux Container inside a Linux VM.  Programs running inside Crostini are heavily isolating from the core C