Step outside the Happy Path
In software engineering, the term happy path refers to the ideal scenario where everything works as planned. It’s natural to begin system design by focusing on these positive scenarios - they’re simple, straightforward, and align with how most people envision their product functioning.
However, stopping at the happy path is a mistake. While it’s often sufficient for stakeholders from business backgrounds who lack technical expertise, it fails to address the majority of a system’s complexity. Robust systems excel not in perfect conditions but in their ability to handle problems and recover from failures.
An architect, technical lead, or principal engineer plays a crucial role in ensuring that all scenarios - not just the happy path - are considered and explained to non-technical stakeholders. Their leadership helps foster a team culture that encourages proactive thinking and the design of resilient systems.
Seeing the Bigger Picture
Experienced engineers - whether architects or senior developers - stand out by having a bird’s-eye view of the systems they build. This means recognizing areas without clear ownership, identifying potential issues, and ensuring product integrity across all teams and components.
In complex systems, particularly distributed ones, overall availability is a product of the availability of each component. Components like microservices, databases, and messaging systems interact with infrastructure elements such as virtual machines, networks, and storage. Failure in one part, such as data consistency or error handling, can lead to cascading issues. Senior engineers are responsible for identifying these risks early and implementing solutions that ensure the system works as a cohesive whole.
Mapping Out Scenarios
A complete understanding of the system is essential to account for every potential scenario. This is especially critical in brownfield projects, where undocumented processes or unaddressed questions are common.
Start by formalizing the architecture. You don’t need heavy enterprise methodologies like TOGAF; use formats that fit the team and product. Frameworks like arc42 are suitable for complex systems, while simpler projects may only need C4 diagrams supplemented with a few additional visualizations.
Identify all data flows within the system. This will help you uncover gaps and overlooked details. At each step, consider what could go wrong. Create a list of possible events, ensuring it is MECE. This practice minimizes blind spots and ensures that every edge case is addressed.
Prioritizing Issues
Not all potential problems are equally likely or impactful. For instance, network errors occur more frequently than hard drive failures, and both are more common than data corruption caused by solar flares.
Similarly, not all problems are equally critical. Losing a log entry is often acceptable, but errors in customer billing are not. Combining the likelihood and importance of each issue helps prioritize which scenarios to address immediately and which can wait. A good system balances proactive preparation with realistic limitations, as it’s impossible to account for every scenario.
Dealing with the Unknown
Even the most thorough analysis won’t uncover every issue. Some problems only become apparent after they occur. For example, throttling in Kubernetes can disrupt Kafka’s heartbeat mechanism, triggering endless rebalancing. Or a misconfigured high-availability setup for virtual machines might cause PostgreSQL transactions to freeze, leading to application crashes. As you can guess, I only discovered these problems when they occurred.
These “unknown unknowns” require systems to be designed for self-healing. For example, managed cloud Kubernetes platforms (e.g., GKE, AKS, EKS) monitor cluster nodes and automatically replace them if they become unresponsive. This ensures the system recovers, regardless of whether the issue was caused by software bugs, network failures, or even server rack fires.
A system capable of self-healing must have a clear understanding of its current and desired states, along with mechanisms to return to the desired state automatically.
Communicating with Stakeholders
Building resilient systems is essential but often expensive. Architects and engineers must communicate with decision-makers - especially those controlling budgets - about the value of investing in solutions for potential problems.
CFOs and other executives may argue, “Our customers only pay for the happy path”. While true to an extent, the happy path alone cannot guarantee long-term reliability or customer satisfaction. Effective communication with non-technical stakeholders requires speaking their language: the language of business.
Explaining the Business Impact
Every technical decision has financial implications. Calculate and present these clearly:
- How much will it cost if the system fails to handle non-happy-path scenarios?
- How likely are these scenarios?
- What are the potential losses in revenue, reputation, or customer trust?
- How much will the solution cost to develop and maintain?
- What additional benefits could the solution provide?
For example, a happy path might assume that a request in a distributed system will complete within 30 seconds. However, as the number of services grows or their workload increases, some requests may exceed the timeout, triggering cascading failures. These might require rollback of distributed transactions, and retrying requests could overload the system, potentially causing hours-long downtime.
One solution could be to shift to asynchronous communication, at least for certain processes. This approach requires implementing and maintaining a message broker, adapting applications, and training engineers to work with a new communication paradigm. While time-consuming and resource-intensive, it would eliminate timeout issues, allow virtually unlimited scaling for critical processes, and enhance fault tolerance through delivery guarantees.
Ultimately, the decision is not solely technical. If a company is under financial pressure, immediate profit from user-facing features might take priority. On the other hand, if the company’s strategy is focused on growth, scalability might align well with its goals.
By speaking the language of money, you can make technical proposals resonate with business decision-makers. Conversely, if you can’t justify a solution financially, it might not be beneficial for the company.
Bring Solutions, Not Just Problems
People tend to dislike hearing about problems without solutions, even if they claim otherwise. When discussing issues with non-technical stakeholders, avoid leaving problems “hanging”. Busy individuals already have enough challenges to address without unresolved technical issues added to the mix. Always present a few potential solutions to provide a starting point.
Problem-solving is a highly valued skill in engineering. Often, exploring solutions independently eliminates the need to escalate issues, allowing you to bring fully-formed answers to your manager or client. This initiative demonstrates reliability and builds trust, as long as the solutions are well thought out.
The Sooner, the Better
Architectural decisions are costly and difficult to change later. The same applies to “problematic” scenarios - delays in addressing them can make implementation significantly more expensive.
That’s why it’s critical to discuss and resolve such scenarios as early as possible. Of course, this can only happen once you become aware of them. How can you identify these scenarios sooner?
Building an Engineering Culture
In traditional waterfall methodologies, architects spend significant time analyzing potential risks and solutions before writing any code. While thorough, this approach is often inefficient for projects in dynamic contexts.
With agile methodologies, teams begin building solutions quickly, addressing problems as they arise. While this reduces time spent on unnecessary scenarios, it risks overlooking critical details and under-planning architecture.
Regardless of methodology, early identification of problems depends not only on analytical tools but also on fostering a strong team culture.
Open Discussions About Problems
Unlike business communication, internal team discussions should be open and direct about potential issues. Encourage brainstorming and collaborative problem-solving to explore various options.
Team members must feel safe raising concerns without fear of blame. Unfortunately, in some workplaces, the “no good deed goes unpunished” culture prevails, where the person who identifies an issue is burdened with solving it alone. This discourages initiative and erodes both team morale and system quality over time.
Building a Problem-Solving Culture
While early identification of problematic scenarios is crucial, it’s equally important to cultivate a team culture that values open communication and initiative. Encourage team members to raise concerns without fear of blame. Collaborative brainstorming can often lead to innovative solutions.
Avoid siloed thinking where teams focus solely on their responsibilities. A sense of ownership over the broader product helps engineers identify risks in gray areas that might otherwise be overlooked. This approach prevents costly issues and fosters a more adaptable and resilient organization.
Conclusion
While identifying and addressing the full range of potential events is a key responsibility of architects and senior engineers, it’s a team effort. Early detection and proactive solutions benefit the product and the company. Communicating these issues effectively, especially to non-technical stakeholders, requires a focus on financial impact and actionable solutions.
No system can predict every failure, but by addressing the most critical and likely scenarios, you ensure resilience. For everything else, self-healing mechanisms provide a safety net, allowing the system to recover independently. By fostering a culture of open communication and initiative, you enable teams to build systems that thrive in an unpredictable world.
comments