Could Modern Systems Engineering avert Starliner failure?
"Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations." - Conway's Law
Imagine the following - you've been selected to go to the International Space Station for an 8 day mission, the capsule you're taking is the new Boeing Starliner. Along the way you run into several issues from helium leaks to thrusters not firing but are able to arrive to ISS. You expect several weeks delay but that turns into likely 8 months. That's the unfortunate reality for Sunita Williams and Barry Wilmore.
Now it's not entirely Boeing's fault for the delay involved, as some it is related to logistics, scheduling, and cost of sending aircraft up. At a high-level these are the issues for the delay:
There are a limited number of parking spots half of them taken by the Russians which use a different docking port. Of the 4 available to the US, 2 are older ports and only used with cargo ships.
Flights need to be scheduled and cannot be done adhoc as the cost is significant.
Either Boeing addresses the issues or the next SpaceX flight will need to be used which would carry cargo and other astronauts up.
If you're thinking this is unique to the aerospace industry, you would be wrong. Systems engineering is complex and involves critical industries such as nuclear powerplants, healthcare systems, transportation, and automotive to name a few. These systems are inherently complex, and their design and management are equally challenging. They often require extensive integration of hardware and software, which is a key aspect of systems engineering. Given all that I think it's worth exploring what happened in this case, how industry trends are affecting businesses, and if there a better engineering approach for organizations in this line of work.
What happened?
Systems Engineering including aerospace is complex as there are numerous subsystems that can be running and need to work in conjunction for successful flight. Given the harsh environment within space, the priorities of these systems focus on stringent safety, reliability, and performance requirements.
Helium is used to apply pressure to move fuel within space, however, it is a volatile substance that can leak easily. While Boeing had 10x the amount of helium they needed for the flight, the amount that leaked was significant enough to cause concern. Additionally, 5 of the 28 thrusters had issues during the flight. Given all these concerns, NASA put a hold on the flight until further notice. Boeing disputes that it is safe enough to make it back to Earth. To make matters worse, the autopilot function isn't working, which could have been used to demonstrate the viability of the shuttle without a crew.
If we look at SR-71 Blackbird it was developed within 6 years, similarly Saturn V rocket which was crucial to the Apollo program was developed in approximately 6 years. These were skunkworks projects that had the same underlying requirements and were successful. The Boeing program, on the other hand, has had years of setbacks. Why were these programs successful back then and what's changed?
Industry Trends
If we look at modern Systems Engineering, the complexity of systems has increased. These are be driven by:
Changes in regulations
Systems integration complexities
Need for cross-disciplinary expertise
Modernization of materials manufacturing
Unmanned and autonomous vehicles
Cybersecurity risks that have increased
Supply chain coordination
Innovation pressures
Market changes have necessitated shifts that simply weren't there in the past. For instance, regulations such as National Environmental Policy Act (NEPA) mean NASA needs to lower environmental impact, that can be in the form of fuel efficiency to composite materials and procurement standards. Post Challenger explosion safety regulations changed as well.
Additionally, such systems will often involve a procurement from multiple vendors due to speed and cost, however, it introduces an overhead for integration and complexity of changing off the shelf components as software IP may reside with the manufacturer. Ford's CEO Jim gave an interview about the complexity of implementing Over the Air updates for legacy automakers being significant due to outsourcing components. For the Mach-e they had approximately 150 modules that were outsourced due to cost, the difficult part was integration. Tesla on the other hand, has majority of it's software insourced, which made it easier for them to rollout OTA. (Source: The Drive)
As times have changed so have the way teams collaborate have as well. In the past, you might have project planning, quality assurance, mechanical, electric, and systems engineering focused around a desk to deliver. This meant everyone could align from planning and requirements to delivery which is precisely why skunkworks teams tend to do well with such projects. In the modern era, teams are using software replacing paper for planning but they are limited by licenses, which inadvertently creates silos between teams to keep costs down.
Now you might be wondering, these issues apply to SpaceX as well, why are they succeeding while Boeing is failing? Let's explore that next.
Modern Engineering for success
It comes down to culture. Boeing is a traditional organization with significant hierarchy, whereas SpaceX is a startup. Boeing has existed over 100 years and it translates to bureaucracy adding up over the decades. A startup such as SpaceX, doesn't have the luxury of continued failure and needs to get things done quickly. Below we explore 5 aspects of the culture differences and how that translates to their results.
Agile approach to hardware - Traditional aerospace engineering approaches dictate no room for failure, and that can add significant pressure on the teams developing a product. This means significant testing requirements, design requires anticipating all scenarios and can add lots of time just in these phases. Agile approach is more iterative, and it's okay to make mistakes earlier in the process, so that they can be learnt and adapted. The complexity introduced with modern requirements means teams need to collaborate, an agile approach forces teams to collaborate towards a demoable product; it's not about perfection rather progress.
The results are Boeing has only completed 2 uncrewed flights and 1 crewed flight, meanwhile SpaceX has had 13 crewed flights since 2020.
Feedback loops - Closely linked to agile systems engineering is the need to create feedback systems so the team can adopt quickly. Telemetry data becomes necessary to provide insights and adjust quickly. Looking at testing practices alone at both companies we see stark differences.
John Mulholland, vice president and manager of Boeing’s CST-100 Starliner program, stated the company performed testing of Starliner's software in chunks. For a business that has significant outsourcing of components integration testing becomes imperative - doing testing in chunks is simply insufficient. SpaceX, on the other hand, has full hardware-in-the-loop tests with actual flight hardware. It means they can simulate missions, test various scenarios without leaving the lab. (Source - said in February)
This meant that NASA investigation team found 3 incidents that caused a mission abort; 2 were critical software issues with Starliner that could have destroyed the spacecraft without Mission Control intervention.
Ownership of outsourced delivery - Knowing what to outsource is just as important as managing outsourced vendors. Critical components and key IP should remain insourced so the knowledge is retained. Additionally, outsourcing components doesn't dissolve the primary organization of all responsibility - you need to ensure the quality of what is delivered, KPIs are met, & the teams delivering both internal and external are appropriately trained.
In a report from Safety and Mission Assurance at NASA "The lack of a trained and qualified workforce increases the risk that the contractor will continue to manufacture parts and components that do not adhere to NASA requirements and industry standards."
Hypothesis Driven Development - When hardware failures are critical, it's best to apply a hypothesis driven development approach. This is an approach where organizations use scientific methods to develop, building from testable assumptions and iterating quickly. One can highlight requirements across the program and tie them in to make decisions building institutional knowledge. Data from experiments (rather than traditional requirements) allows teams to shift approaches quickly and validate. This means continuous integration of code so that it's tested regularly.
As one of the members of the SpaceX team put it in a reddit AMA "The cases are set up such that if we violate any key performance indicators, the case 'fails' and an engineer takes a look."
Process over tooling - Far too often organizations change process to meet complex software systems that were onboarded, a great example of this are enterprise resource planning tools. While adjusting business processes to adopt tooling might be tempting (typically because of cost), it can be a major impediment to how teams work. How teams collaborate is more important than adjusting to custom off the shelf product and can cost significant loss in productivity.
In SpaceX's case, they built a custom ERP, WarpDrive to accommodate the businesses needs as other systems weren't meeting their needs. While costly in terms of internal engineering hours, from organizational objectives it helped them get to launch faster.


