TeamCity Cloud is a continuous integration solution, operating as a service (SaaS). One major goal of the service is to provide virtual machines which execute CI builds. We call them JetBrains-hosted agents. These are a large fleet of machines with different OS and hardware, built upon AWS EC2. We’re fully responsible for provisioning and maintaining JetBrains-hosted agents and the underlying infrastructure.
One key challenge is the provisioning time of a machine. Typically a user expects the machine to be ready in less than 30 seconds. However, it takes around 2 minutes to boot a Linux machine from scratch and 5 minutes for a Windows one. To guarantee better provisioning times, we maintain a pool of pre-provisioned machines. When a client requests a machine, we deliver a fully bootstrapped instance, so it takes around 5 seconds for a client to start executing their task.
The size of a pre-warmed pool is not static – our autoscaler continuously tries to change it. On the one hand, it should guarantee there are enough machines in the pool to meet the demand, on the other hand, it should minimize the idle time of the machines, as Amazon bills us for the usage. To do so, the autoscaler uses a set of primitive heuristics. Currently, it helps us to serve around 85 percent of machines’ requests from the pre-warmed pool for an affordable price.
However, this ratio is not ideal, and we’re looking for ways to improve it!
During this internship project you will be involved in:
Implementing different strategies to improve the autoscaling performance (heuristics-based, ML-based or leveraging AWS features)
Collecting the related metrics to analyse the performance and the customers’ satisfaction
Improving the evaluation framework
Improving the UX of administrating the autoscaler
And, hopefully, a lot of fun!
Required:
knowledge of Kotlin and Java
basic understanding of algorithms and data structures
basic understanding of cloud native applications and cloud services
Would be a plus:
experience with data analytics
understanding of Machine Learning algorithms
familiarity with Site Reliability approaches