Geo-distributed Cloud Provider Infrastructure
The main assets of a cloud provider are managed services (IaaS, PaaS, FaaS, etc.) and a robust infrastructure that is resilient to both short-term problems and large-scale disasters.
While service development follows common processes, the design and implementation of cloud infrastructure require expertise in many related areas - data center equipment, resilient network infrastructure, redundancy to handle large-scale failures, multi-layered security, and compliance with standards and government regulations.
Building a Robust Platform
The customer has an existing network of data centers in several regions, so Ilya has designed a cloud provider architecture based on it, that meets the goals. The main achievement is the fault tolerance and disaster tolerance of the geo-distributed platform. To enable rapid growth of the customer base, the possibility of resource scaling was implemented in each region. Also, a fast-growing provider requires the ability to quickly establish access to services in new regions, which was achieved through the repeatability of the regional infrastructure.
“A disaster-tolerant infrastructure must continue to work even if one of the data centers is completely destroyed.”
The infrastructure for disaster tolerance is deployed across three or more data centers. The computing load is distributed evenly among them thanks to two Kubernetes clusters - one for applications and one for instrumentation. A resource capacity monitoring system has been implemented, which blocks new services from being created when a certain usage threshold is reached. This ensures that enough resources are available to reallocate processes and data from an inaccessible data center to ensure disaster recovery. Software-defined networking allows for easy configuration of network policies in a consolidated network of multiple data centers. To ensure security, a set of network addressing rules and approaches have been implemented that ensure complete isolation of different clients and adherence to current anti-attack practices. This allowed us to pass several security audits and fend off a serious attack.
Designing a Flexible Infrastructure
The nature of the pay-as-you-go model of most services requires the platform to be prepared for uneven load and a sudden increase in the number of clients. This can be difficult if the process of commissioning new servers takes a long time. The design Ilya created allows for the rapid introduction of new servers, as well as easy scaling of the overall resource pool. Services run in distributed Kubernetes cluster, and stateful apps run in distributed replica mode (PostgreSQL), or as part of a distributed cluster (MongoDB, Kafka). This allows for rapid growth of the region’s resource pool by commissioning new resources in the region - connecting new servers takes no more than a day, and setting up a data center takes less than five days.
“Fast entry into new markets is essential in the highly competitive cloud services business.”
Entering new markets requires the allocation of new resources and low network latency between the data center and the customer. However, the ability to extend resources in the region is limited, and the channel between the server and a remote client is often unstable and slow. To cope with this challenge, the provider launches data centers in a new region. But commissioning a new data center is a complex task that can take months of work.
To speed up this process, the infrastructure is described as Code, for which Terraform was used. This reduced the time to configure a new data center to put it into operation to one week.
These actions have put in place a geo-distributed infrastructure that meets all security and scalability requirements, giving availability 99.99% of the time within the region.