The goal of this project is to study how small and medium-size language models can acquire reasoning capabilities under strict computational budgets, and how architectural choices affect the trade-off between accuracy and compute.
The project consists of two parts. First, a masked language model is trained for general text understanding using datasets such as WikiText-103 or PubMed abstracts. Two model families are compared (e.g. BERT-Medium and a Transformer-based recurrent model, TRM), with special attention to matching model capacity and budget. The main outcome of this stage is a systematic comparison of accuracy versus compute budget.
Second, the focus shifts to explicit recurrent reasoning on a structured task (Sudoku-Extreme). Smaller models (e.g. BERT-Small and TRM) are evaluated in a recurrent setting. This includes designing a way to embed a BERT-like model into a recurrent loop and defining how the internal state is passed between steps. The goal is to analyze how effective compute depth influences reasoning accuracy under fixed budgets.
The expected outcome is a working experimental setup, reproducible implementations, and empirical insights into how recurrence and architectural constraints enable or limit reasoning in small models, supported by accuracy-vs-budget and depth-vs-accuracy analyses. Potential architectural or training improvements are welcome.
Basic proficiency in Python and some experience using it for machine learning experiments.
General familiarity with machine learning and deep learning concepts.
Introductory understanding of language models and Transformer-based architectures.
Willingness to read research papers and translate ideas into working code.
Readiness to train models, run controlled experiments, and analyze results.
Interest in exploring how architectural choices and compute constraints affect model behaviour.