policy maps observation to x

observation: 类比于渲染出来的画面 state: 类比于计算机内存状态

if given , is conditionally independent of - > markov property: state: 预测未来的现在状态的信息

goal: given observation, learn policy

behavior clone: using supervised learning

可能会有较大偏差(在传统supervised learning中不会发生 因为是iid) 你在此时刻的选择有一点点改动会影响下一时刻的action

WHY behavior cloning fail?-math