University of Wisconsin-Madison
Last Updated: 02/09/2026
(Page adapted from Nan Jiang's CS 542/598 project page and constantly updated.)
Approximate Dynamic Programming (ADP) concerns obtaining approximate solutions to large planning problems, often with the help of sampling and function approximation. Many ADP methods can be considered as prototype algorithms for popular value-based RL algorithms used today, especially in the offline setting, so it is important to understand their behaviors and guarantees.
Offline RL without exploratory data
Online + Offline (Hybrid)
Lower bounds
How to estimate the performance of a policy using data collected from a different policy? This question has important implications in safety and real-world applications of RL.
OPE in Bandits
Importance Sampling for RL
Marginalized Importance Sampling and Beyond
These methods directly minimize the projected Bellman error for linear value-function approximation, and have some different characteristics compared to other TD methods that learn from bootstrapped targets.
Online
Offline
When Markov assumption does not hold and the world is partially observable, how can you recover state from observations and plan based on the belief?
Planning
Learning
Computing near-optimal policy in large MDPs by sampling in a generative model of the environment.
Structural Assumptions (Bellman rank, Eluder dimension, etc.)
Latent State Discovery
Low-rank/Linear MDPs
Thompson sampling
Lower bounds
Lower bounds
Exploiting side information or structures in easy problems
Concurrent/"Low-Switch"/"Deployment-efficient" PAC-RL
Multi-task & Life-long Learning and Learning across Population
PAC Hierarchical RL
Apart from value/policy iteration, Linear Programming (LP) is another standard method for solving MDPs. The LP formulation also reveals many interesting properties of MDPs (e.g., the dual formulation has occupancy measure as its decision variables).
In batch RL we often assume that logged actions are function of states, which are not necessarily true in some application scenarios. If actions in the data were determined based on hidden factors, standard algorithms may suffer from the confounding effect, and we may seek tools from causal inference to mitigate it.
Conservative approaches to MDPs. Compute a policy that is robust to parameter uncertainty of MDPs. Or when the MDP is fully known, compute a policy whose return is worst-case optimal w.r.t. the randomness of trajectory.
How to design an RL algorithm that keeps improving the policy monotonically? This motivation inspired the original CPI algorithm, and later its variants such as TRPO.
How to mimic a policy from expert demonstrations? A pure supervised learning approach ("behavior cloning") may suffer from distribution shift. Interactive methods, such as DAgger, are more robust against it.
In control theory, the most basic model of a dynamic environment is a linear system. How to do RL in continuous state space and what guarantees can be obtained?
State abstraction is a simplest form of approximation that collapses a high-dimensional state representation to a more manageable one. Understanding of abstractions forms the basis for understanding more complex function approximation schemes. However, designing a good abstraction is a very difficult engineering task; can we have the algorithm automatically select among candidate representations in learning?
Factored MDP is a tractable formulation of large scale MDPs: state is represented by factors, and the future value of each factor only depends on the values of a small number of parent factors now.
Convergence analyses of Q-learning, TD, etc.