where γ ∈ (0, 1] is a discount factor that prioritizes earlierrewards over later ones. The goal of reinforcement learningis to train an agent with policy π to maximize the expectedsum of returns, defined as R = Eri≥1,xi≥1∼E,ui≥1∼π[R1].To optimize the expected return R, various of model-freeand model-based algorithms are proposed. In the next subsection, we will review the most recent deep Q-learningalgorithm for a continuous action space, which is thebasis of the learning method in our experiments. Now,we demonstrate how to apply reinforcement learning tohyperparameter optimization for tracking.