The goal of this project is to study whether reinforcement learning at test time can improve model performance when supervision is limited or partially synthetic, and to understand under which conditions such improvements are most effective.
The project is structured in two stages. First, a synthetic supervision pipeline is constructed for an internal entity classification dataset. A small language model (approximately 3–8B parameters) of sufficient baseline quality is selected, and an artificial label generation tool is implemented. The quality of the generated labels is evaluated by comparing them against available gold annotations, with the aim of understanding their reliability and biases.
Second, reinforcement learning experiments are conducted using a policy-optimization (PO) framework of choice. Two training settings are compared: (1) training and adapting the model using only synthetic data, with evaluation before and after adaptation; and (2) training on real labeled data followed by additional adaptation using synthetic data, with evaluation both during training and at test time.
The expected outcome is a clear empirical assessment of whether synthetic supervision is beneficial for test-time reinforcement learning and how its impact depends on the model’s initial training state.
Basic proficiency in Python and experience running machine learning experiments.
General familiarity with machine learning and deep learning concepts.
Introductory understanding of language models and common fine-tuning or adaptation techniques.
Willingness to read research papers and implement existing methods or frameworks.
Readiness to work with synthetic and real datasets, run training and evaluation loops, and analyze results.
Interest in experimental study of training strategies and their impact on model performance.