Internships
Register
Copyright © 2000—2026 JetBrains s.r.o.

Predictive alerting for cloud metrics

Description

Cloud service metrics are challenging to model reliably: correlations between metrics and incidents are often weak and short-lived, system behavior can shift abruptly from stable operation to turbulent regimes, and long historical windows do not necessarily improve predictive power. In addition, services adapt to a changing load, causing metric patterns to evolve over time, while metric distributions are often heavy-tailed rather than well-behaved. As a result, selecting an appropriate modeling approach is a central part of this project rather than a purely technical detail.

The goal of this project is to design and implement a predictive alerting system that expects incidents in cloud services based on historical metric data. The task combines machine learning and basic DevOps aspects, with emphasis on informed model selection, training, and evaluation.

We will provide historical CloudWatch metrics and existing alert conditions for a set of AWS-based projects. The core objective is to train a model that predicts short-term future behavior of metrics (or the probability of an incident) and to use these predictions to raise alerts before incidents occur.

On the ML side, this includes selecting an appropriate model architecture for time-series forecasting or incident prediction, preparing training data from raw metrics, training and validating the model, and defining evaluation criteria. On the systems side, a reference implementation may include two AWS Lambda functions: one that periodically (e.g. daily) retrains or updates the model using recent data and stores artifacts in S3, and another that runs frequently (e.g. every minute) to generate predictions and trigger alerts when the predicted risk exceeds a threshold.

A successful outcome is a working end-to-end prototype of a predictive alerting system, together with an empirical evaluation demonstrating that the system expects a significant fraction of real incidents. Concretely, this may be expressed as achieving, on a held-out evaluation period, a recall of approximately 80% with respect to existing incident-triggering alerts (i.e. the model raises at least one alert before the start of an incident for roughly 80% of incident intervals), while keeping the false-positive rate at a reasonable level. Detection lead time (how early the alert is raised before the incident) and precision-recall trade-offs should be reported and discussed.

Insights into model choice, failure cases, and possible improvements are an important part of the result.

Requirements

  • Basic proficiency in Python and experience running machine learning experiments.

  • General familiarity with machine learning concepts, in particular supervised learning and time-series data.

  • Introductory understanding of modeling and evaluation for noisy or non-stationary data.

  • Willingness to read technical material and implement proposed approaches in code.

  • Basic familiarity with cloud-based workflows (e.g. AWS concepts such as CloudWatch metrics, Lambda functions, or scheduled jobs), or readiness to learn them during the project.

  • Interest in building ML models that operate under real-world constraints, such as periodic retraining, streaming inference, and alert triggering.

Admission

Internship Projects Summer/Fall 2026

Contact details

internship@jetbrains.com

Preferred internship location

Armenia
Cyprus
Czechia
Germany
Netherlands
Poland
Serbia
Spain
UK

Technologies

Deep learning
Python

Area

DevOps
Machine Learning

Internship timing preferences

Part-time acceptable
Applications by 16.03.2026
Interview by 17.04.2026
Feedback and final results by 22.04.2026