Sunday, September 24, 2017

Black Swan IT Projects: The Loan Servicing mainframe replacement


This blog discuss a little the "the Mainframe Servicing System Migration", a project that should considered a Black Swan

A Black Swan Event  
The black swan theory or theory of black swan events is a metaphor that describes an event that comes as a surprise, has a major effect, and is often inappropriately rationalized after the fact with the benefit of hindsight. The term is based on an ancient saying which presumed black swans did not exist, but the saying was rewritten after black swans were discovered in the wild. 
The Fannie Mae loan processing servicing system replacement was
  1. Initially budgeted for 18 months and $75M. 
  2. Eventually cost about 72 months and > $800M.
The project turned out to be a black swan that could have bankrupted other less stable companies.

In the Mid 2000s

Fannie Mae closed out either Q3 or Q4 in that year with a recorded profit of $1B. This was a "peak profit" period that stood out. They decided to push some of this excess cash into IT modernization targeting the Mainframe Mortgage Servicing system which handled all inbound Fannie Mae mortgage payments from servicer's and outbound servicing payments to servicers and Collateralized Mortgage Securities.

Fannie Mae announced that they were going to rewrite all mainframe applications and eliminate all Mainframe positions in 18 months by a total system replacement of all mainframe systems. 1/2 the staff were told they would be allowed to re-interview for new positions at the end of the 18 months. 1/2 of the staff were told they would be laid off at that time and receive termination packages. All of those positions and the systems they supported still existed 6 years later.

Software Project Complexity

There is a lot of data that shows the complexity and difficulty in estimation and construction in software projects.  Success rates drop dramatically with increases in size and scope until a point at which almost all projects of a certain size are considered failures when measured against cost, time and the needs of the business. The most well known is probably Standish Chaos Report.  You can find publicly accessible versions in various places.  Gartner has its own versions of these types of reports that back up the notion of high software project failure rates. You can also find competing claims that there is no software crisis.  My personal experience tells the opposite, that the Standish Group may actually be optimists. 

Mortgage Servicing System

Fannie Mae and Freddie Mac owned approximately 40% of the US Home Mortgage market around this time. The Fannie Mae mainframe systems handled load acquisition from the sales side, loan servicing on the payment side and other financial operational processes.   The Fannie Mae servicing system was essentially run/managed by a core team of about 6-10 people.   The two senior team members were near or past retirement ages.  Fannie Mae was essentially living on borrowed time.    

The Project Announcement

Fannie Mae executives announced that they were going to replace the mainframe systems in 18 months at a cost of $75M starting within a couple months.  The declared that they were moving folks from other projects and bidding out the core work to a major integrator.

Phases

Project Initiation

A large (3 letter) computer company/integrator won the project.  Their winning bid included a quick staff ramp up targeting bringing 400 people "on site" to work on this project.  Note that this project was targeted at 18 month including the spin up phase.

SOA custom transactional application

The first attempt was an individual transaction oriented system based on asynchronous Service Oriented Architecture.  It was a completely "modern" approach for that time period. 

The initial system was started with partial requirements.  It had up to 400 new contractors, many of whom had no enterprise experience in the tool suite.  The team picked an architecture that no one on the team had ever built to that scale before.  A data center was populated based on the assumption that the system would go live within the project window. The system was complicated by the fact that it was replacing the heart of the entire servicing operation while not impacting current operations.  

Phase 1 ran for 2+ years and was a complete failure when based on cost, features and meeting the business needs.  Many of the hardware leases expired before any production software was ever installed. The $100M+ spent was not a complete waste in that the time was used to shake out the requirements and get a better feel for the operational needs. I'd guess Fannie Mae had already spent over $200M at this point on a $75M project.

ETL 

The second attempt was basically an ETL based system with large data import/export operations bookending batch type processing.  Think of this as a slightly more mainframe type approach. Staff was reorganized. Management shuffled.  The new technology was used.  

Phase 2 ran for 2+ years and was pretty much a failure based on the criteria of cost, schedule and features.  The project was > 100% over time and > 400% over budget at this point.

Shrink the Project.  Shove it all in the Database

The teams realized, by the third attempt, that moving massive amounts of data in and out of relational databases was too slow for their operations.  Mainframes were optimized for this type of work.  The project went with an approach where all the data stayed in the database.  They also reduced the scope of the project to certain transaction types,  reducing the amount of data to less than half of the previous system attempts.

Phase 3 ran for several users and "went to production".   

Final Analysis

Is a project successful it if makes it to production? Let us standardize on a PMI definition for success.
“Achieving project objectives within schedule and within budget, to satisfy the stakeholder and deliver business value"
Some will adjust the definition to declare a project successful to be just the fact that it made it to production.  This is a false definition that makes it impossible to measure progress and compare projects.

I would normally say that managing scope on a project to keep it under 18 months would be a way of increasing the odds of success (cost, time and business value).  This is only true if you adjust one of the axis, cost, time or features, to make the project fit.  Projects fail when they are too large or when they are force fitted into political realities.  

This particular project was like many others.  It would eventually "go live" making it a success at some level.  It was a failure in the sense that the entire team turned over several times, that it was over budget, over time and had a smaller feature set than envisioned.


Disclaimer: This represents my personal observations and does not reflect the opinions of anyone working for Fannie Mae or any of its contracted employees or companies during that time.

No comments:

Post a Comment