New versions of coding agents like Junie or Claude Code are often evaluated on SWE-like benchmarks. A central metric for comparison is the percentage of successfully resolved tasks. However, that alone might not be enough because it ignores the actual path to the result, i.e., the agent’s trajectory. Such trajectories are available for many coding agents because they are key to debugging and understanding the agent’s behaviour.
The research goal of this internship is to measure and compare the quality of different input instructions given to an agent and the quality of different agents. To achieve this, it is required to develop, implement, and evaluate quality metrics for comparing various coding agents, different versions of the same agent, or different prompts for the same task and agent.
For human-centric software delivery, there are various metric frameworks, such as the DORA metrics. However, agentic trajectories differ fundamentally in cadence and granularity. Delivery processes move between milestones, dominated by peer-reviewed PRs and stakeholder-visible releases. In contrast, trajectories are high-frequency "scratchpads" — narrow-scope sequences of edits and test executions heading toward a PR.
DORA metrics like Change Failure Rate and Lead Time for Changes are macro-level barometers; they quantify integrated system stability and organisational friction rather than the transient, incomplete states of a scratchpad. Consequently, they are blind to the "internal physics" of an agent’s journey. They cannot distinguish between efficient reasoning and high-cost trial-and-error because they rely on the finality of a release. Just as refinery output cannot measure drill-bit efficiency, DORA is too distant from the source to evaluate agentic logic. This gap necessitates the new class of trajectory-specific metrics explored in this internship.
The intern will develop a systematic framework for assessing agentic trajectory quality, moving beyond binary pass/fail outcomes to quantify the efficiency and "inference waste" of autonomous software agents. By evaluating metrics such as Agentic Edit Churn, which distinguishes between productive code authorship and redundant "thrashing", the intern will benchmark how different models navigate complex search spaces. A primary objective is to correlate these trajectory-based signals with human-judged PR quality and operational cost, providing a rigorous foundation for optimising the reliability and value of automated software delivery.
The project should provide answers to the following research questions:
What is the state of the art in white and grey literature on such metrics?
Which metrics (from a set we will provide) best assess the quality of agentic trajectories?
How do these metrics behave when evaluated on various coding agents, prompts, tasks, and projects?
An applicant to this project should have strong research skills to familiarise themselves with the existing literature and state of the art, as well as to develop new metrics, which is one of the project's goals. Additionally, we require proficiency in Kotlin, Java, or Python to implement and evaluate the proposed solution.