Kubernetes-as-a-Service with High Availability

Kubernetes is the obvious solution for building a cloud-native infrastructure. The client, a cloud provider, wanted a Kubernetes cluster management service that would provide the customer with a high level of quality and compete with other providers. At the time, there was an MVP of the service, but it was built in a hurry and did not have the necessary characteristics.

To provide a competitive level of service, you need to ensure rapid feature development and bug fixes, provide superior availability and reliability of service and infrastructure, and make product and customer support efficient and fast.

The existing MVP solution did not meet the goals set for the product, and did not allow for fast development. Therefore Ilya developed a plan for gradual transition to a new cloud-native product architecture based on Domain-Driven Development. The applications were adapted to run in containers simultaneously in multiple replicas, which allowed to solve the scaling problem when the number of clients increased dramatically, e.g. during exhibitions and conferences.

Any cloud service provides customers with guarantees of high availability, which often distinguishes them from on-premise services. The goal of the SLA was 99.99% of the time the service should be available and work correctly. This is quite a complex technical task, the solution of which required several approaches to be implemented.

Managing Kubernetes clusters is a long process, which can take up to 15 minutes. Such a process can involve a dozen services at different infrastructure levels. Any one of these services can crash or lose network connectivity, thereby interrupting the process. To address this issue, communication between services has been redesigned to CQRS with asynchronous commands passed between services via a message broker. The system can now withstand a crash or temporary unavailability of any of the services, and it is also possible to deploy new versions right during client cluster creation - the system is able to handle this without any problems.

Any system can fail, and to achieve high availability it is necessary to be prepared for them. For this purpose, the services were shaped as cloud-native - running in containers with guaranteed startup and shutdown times, the ability to restart quickly without data loss, proper behavior in case of sudden crashes. This allowed for guaranteed execution on client operations with minimal risk of data loss. The provider has many customers and is growing rapidly, so effective resource management is key to the product’s financial success. For the product and its entire environment, an infrastructure has been developed that can survive major disasters, including data center outages. It includes a Kubernetes cluster to run services, providing efficient resource utilization and extensive scalability. Databases and message brokers are run as a cluster capable of surviving a major disruption. Data is replicated to an independent data center for disaster recovery.

A successful product requires not only providing access to the customer, but also ensuring seamless quality throughout the use of the product. An MVP stage product did not have the ability to find and fix bugs quickly - troubleshooting could take hours. Ilya implemented an end-to-end observation system linking distributed tracing, logging, and error monitoring, which reduced the error search time by 80%, reducing it to minutes. A smart administration system was also implemented, allowing the help-desk to solve problems in a few clicks that previously required hours of manual work.

These decisions and competent management of the development and integration process, which Ilya performed, allowed the product to enter the market in a short time and earn popularity among customers. This quickly put the product on par with the competitors and attracted resources for the further development of the company.