Differences between ML Build and Train vs Production

August 02, 2020

Machine Learning Build and Train and Production Execution often use different controls, management, and run time platforms and languages. Model invocation and feature conversion techniques are different during the exploration, training, and production execution phases.

The Machine Learning process is often referred to as Build and Train. This is where data scientists and data analysts attempt to understand the true inputs to their decision-making process. They manually manipulate data into forms that can be fed to Machine Models for training. Those models are then analyzed for predictive behavior and the whole process repeats until a target model is created.

Production inputs (Features) must be transformed into the same form they were during Build and Train. This means that the production system needs to run the same data transformations done during Build in Train. Production feature generation and model invocation are more rigorously created and executed than they are in the Build and Train environment. Similar operations often use different tooling that is better adapted to controls and governance of that phase.

Governance

Transforms and models are often governance artifacts. They need to be saved and reviewed and bound to lineage tracking if used within regulated environments.

Video

Video version of this blog. It is included here in the middle of the flow to make it easier to find :-)

The Same but Not The Same

The two pipelines look the same but they have very different controls and tooling.

The exploratory environment may use Jypter Notebooks, direct SQL and a user-accessible compute environment. Model training may execute dozens or thousands of operations against the same Features. Models are tested and results gathered.
The production environment has code quality standards, specific transformation platforms, and deployable-only computing environments. Model execution executes once per each set of inputs.

Batch or Activity/Event Driven

Models are initially trained in batch mode. Model execution happens as part of Model Testing. That is also generally also done in batch mode as a post-training step. Features are calculated en-masse and then applied to the model as part of a training cycle.

Production models can be executed in batch, near-time, or real-time. This means they may be deployed as batch tasks or API endpoints or other methods.

Production features may be generated via batch or as needed. Features used only in batch model execution are often created via batch. Features used in API model invocation are often a mix of batch and real-time generated. Reference, account, demographic, and other slow-changing data may be converted to Features via batch. Clickstream, user action, event-driven Features may be created as part of the model invocation.

Build and Train vs Traditional SDLC

Build and Train and Production execution with retraining are the same but different. Their forcing functions are different.

Data Exploration needs freewheeling access today and the ability to store intermediate results and work on them later. They need flexible secure compute that can change without extensive planning.
Software engineering is about control and repeatability. It runs in constrained environments where compliance can be as important as results.

How we take something built by hand and approved by Model Risk and convert that into a repeatable process when the tools are completely different.

Exploratory work often involves Notebook technology and many iterative passes at the data. Production automation need transforms in their format and needs to know all the transforms required from beginning to in order to re-create the feature without human intervention.

Training Data - Production and Incidental PII

Models can only be trained with production or very production-like. Data. In most companies or organizations this means the data scientists and the build system must have access to production data which impacts data governance and access controls.

Incremental and automated model retraining systems must also have access to current production data. This means automated ML training run within the production data scope.

PII is not generally required in training data. There are instances though where PII may accidentally or incidentally be present. Customer call recordings are a good example where arbitrary PII may exist in the call logs. This means Machine Model Training must often operate within the most sensitive data zones.

Production or Off-line Model Retraining

Models have to be retrained as data changes. This can be done manually or via automation. There is a certain amount of complexity and governance in this.

Should this be data or time-triggered?
If data triggered then how does that call back into non-production Build/Train or CI/CD
etc..

Blog de Joe Freeman