Dense Dynamics-Aware Reward Synthesis: Integrating Prior Experience with Demonstrations
Many continuous control problems can be formulated as sparse-reward reinforcement learning tasks. In principle, online reinforcement learning methods can automatically explore the state space to solve each new task. However, discovering sequences of actions which lead to a non-zero reward becomes exponentially more difficult as the task horizon increases. Manually shaping rewards can accelerate learning for a fixed task, but it can be an arduous process that must be repeated for each new environment. This work introduces a systematic reward-shaping framework which distills the information contained in 1) a task-agnostic prior data set and 2) a small number of task-specific expert demonstrations, and then uses these priors to synthesize dense dynamics-aware rewards for the given task. This supervision substantially accelerates learning in our experiments, and we provide analysis demonstrating how the approach can effectively guide online learning agents to faraway goals.