Antoine Moulin

Hi! I am a PhD student in reinforcement learning (RL). My thesis aims to gain a deeper understanding of the challenges posed by large-scale RL by identifying and exploiting structural properties of Markov decision processes (MDPs) that make learning statistically and computationally feasible.

My research interests also include: online learning, imitation learning, and language modeling.

I am co-advised by Gergely Neu and Arthur Gretton. You can reach out to me at: firstname [dot] lastname [at] upf [dot] edu.

profile photo

News

Publications & Preprints

[] indicates my favorite papers.

2026

A benchmark of expert-level academic questions to assess AI capabilities (Humanity's Last Exam)

Center for AI Safety, Scale AI, HLE Contributors Consortium
Nature
arxiv

tl;dr: multi-modal benchmark with multi-choice/short answer questions on various topics. little contribution from me.

2025

Outcome-Aware Spectral Feature Learning for Instrumental Variable Regression

Dimitri Meunier, Jakub Wornbard, Vladimir R Kostic, Antoine Moulin, Alek Frölich, Karim Lounici, Massimiliano Pontil, Arthur Gretton
preprint
arxiv

tl;dr: since standard spectral features are agnostic to the outcome, they fail under misalignment; we remedy this by regularizing them towards the outcome via an augmented operator and a contrastive loss.

Inverse Q-Learning Done Right: Offline Imitation Learning in Qπ-Realizable MDPs

Antoine Moulin, Gergely Neu, Luca Viano
(NeurIPS 2025) 39th Annual Conference on Neural Information Processing Systems
arxiv

tl;dr: we propose a primal-dual method able to provably match the return of the expert in linear and general Qπ-realizable MDPs, providing an alternative to maximum likelihood estimation (also known as behavior cloning, or next-token prediction) which can fail dramatically when expert realizability does not hold.

Demystifying Spectral Feature Learning for Instrumental Variable Regression

Dimitri Meunier, Antoine Moulin, Jakub Wornbard, Vladimir R. Kostic, Arthur Gretton
(NeurIPS 2025) 39th Annual Conference on Neural Information Processing Systems
arxiv

tl;dr: we focus on understanding why and when spectral methods work for IV regression, and show it depends on the alignment of the target function with the top singular directions of a conditional expectation operator (dubbed "spectral alignment"), and the singular value decay (which captures instrument strength).

When Lower-Order Terms Dominate: Improved Loss-Range Adaptivity for Experts Algorithms

Antoine Moulin, Emmanuel Esposito, Dirk van der Hoeven
(NeurIPS 2025) 39th Annual Conference on Neural Information Processing Systems
arxiv

tl;dr: we develop algorithms for the experts problem with possibly heavy-tailed losses that are adaptive to the second moment, and show they achieve best-of-both worlds guarantees.

Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning

Antoine Moulin, Gergely Neu, Luca Viano
(COLT 2025) 38th Annual Conference on Learning Theory
Contributed talk at EWRL 2025
arxiv

tl;dr: we provide the first computationally efficient algorithm achieving rate-optimal regret in discounted, linear MDPs; it combines optimistic exploration and artificial transitions to an absorbing state with maximal return. we also apply it to interactive imitation learning, where we get state-of-the-art guarantees.

Spectral Representation for Causal Estimation with Hidden Confounders

Haotian Sun, Antoine Moulin, Tongzheng Ren, Arthur Gretton, Bo Dai
(AISTATS 2025) 28th International Conference on Artificial Intelligence and Statistics
arxiv

tl;dr: under a low-rank assumption on suitable conditional distributions (analogous to the low-rank MDP assumption made in RL), we propose a primal-dual method that performs well on IV and PCL problems.

2023

Optimistic Planning by Regularized Dynamic Programming

Antoine Moulin, Gergely Neu
(ICML 2023) 40th International Conference on Machine Learning
arxiv

tl;dr: we analyze optimistic value iteration in discounted MDPs and show that regularization can be used to avoid contraction and monotonicity arguments which typically do not hold under function approximation.

Talks

Tutorial on Imitation Learning

  • 01/2026University of Oxford. Oxford, UK.

Efficient Exploration in Linear Markov Decision Processes

  • 01/2026University of Oxford. Oxford, UK.
  • 11/2025Isaac Newton Institute. Cambridge, UK.

Inverse Q-Learning for Offline Imitation Learning

  • 09/2025Università degli Studi di Milano. Milan, Italy.

Learning in Adversarial Linear MDPs

  • 04/2024University of Tokyo. Tokyo, Japan.

Optimistic Planning by Regularized Dynamic Programming

  • 08/2023Princeton University. Princeton, NJ.
  • 07/2023Stanford University. Stanford, CA.

Infinite Horizon MDPs under Function Approximation

  • 03/2023Universitat Pompeu Fabra. Barcelona, Spain.

Primal-Dual Methods for Reinforcement Learning

  • 09/2022Gatsby Unit, UCL. London, UK.

Introduction to JAX

  • 09/2021ELLIS Doctoral Symposium 2021. Tübingen, Germany.

Virtual Sculpture

  • 06/2018Journée de l'innovation (finalist). Paris, France.