The foundation of imitation learning is a Markov Decision Process which satisfies the Markov property. The Markov property is a chain of state sequences x₁, x₂, … xₜ where the next state xₜ only depends on the current state xₜ₋₁. A Markov Decision Process contains a set of possible world states S, a set of possible actions A, a reward function R(s, a) and a description T of each action's effects in each state (Givan and Parr, 2001). Each state and action specify a new state
T: S × A → S
and each state and action specify a probability distribution for all the other states
T: S × A → P(s' | s, a).
In imitation learning the machine takes an input of expert demonstrations or trajectories
𝜏 = (s₀, a₀, s₁, a₁, …)
where the state-action pairs are formed based on the expert's optimal π* policy. To follow a policy π the machine determines the current state S and executes an action π(s). This process is then repeated.
Behavioral cloning is the simplest form of imitation learning and it works by directly mapping from states/contexts to trajectories/actions without recovering the reward function (Osa et al., 2018). These methods treat state-action pairs as independent and identically distributed variables and learn policy π through supervised learning and minimizing the loss function.