Great models are built on great evaluation. At JetBrains AI, evaluation is a first-class research problem: we need to understand not only whether a model performs well on standard coding benchmarks, but also whether it succeeds in realistic, multi-step, agentic developer workflows.
In this internship project, you will help design and build the evaluation stack for coding LLMs across the full pipeline — from pretraining and post-training evaluation to multi-turn, agentic assessments. This includes implementing existing benchmarks, creating new ones where current evaluation is insufficient, and building infrastructure to measure model behavior in settings that look much closer to real software development.
The scope is intentionally ambitious. There is a lot to do: better coding benchmarks, better regression tracking, more realistic multi-turn tasks, infrastructure for automated evaluation, and new ways of measuring capabilities that matter for developers. You will work at the intersection of LLM research, benchmarking, and systems, helping define how we decide whether a new model is actually better.
This project is ideal for someone interested in one of the most important open problems in modern ML: how to evaluate frontier models in a way that is rigorous, scalable, and aligned with real developer workflows.
## What you will work on
- Implement and improve evaluation pipelines for coding LLMs.
- Work on both pretraining and post-training evaluation.
- Integrate existing benchmarks and help design new ones.
- Build infrastructure for multi-turn and agentic evaluations.
- Develop metrics, regression tracking, and automated evaluation workflows.
- Study model behavior in realistic coding and developer-assistant scenarios.
## Why this project is exciting
- You will work on high-impact evaluation problems that directly shape model development.
- You will go beyond standard benchmarks into agentic, multi-turn, and more realistic coding evaluations.
- You will help define new benchmarks and metrics where the field still has open problems.
- You will work with a team that has the infrastructure, compute, and research freedom to turn ideas into practice.
## Requirements
We’ll be happy to have you on this project if you have:
- A solid background in machine learning, NLP, or a related technical field.
- Good programming skills in Python.
- Interest in LLM evaluation, benchmarking, and experimental methodology.
- Ability to work with datasets, metrics, and automated pipelines.
- Curiosity about how coding models behave in realistic and agentic settings.
- Ability to read technical papers and implement or adapt benchmark ideas with support from the team.
- Attention to detail and strong communication skills.
## Nice to have
- Familiarity with LLM benchmarks or evaluation frameworks.
- Experience with code execution environments, testing pipelines, or experiment tracking tools.
- Interest in agentic systems, tool use, or multi-turn evaluation.
- Previous coursework, research, or side projects in ML, NLP, or software engineering.