Reinforcement learning agent with two actions (a1,a2) and three states (S1,S2,S3).
After a period interacting with the environment we have the following values of the Q function:
Q1(S1,a1) = -2
Q2(S1,a2) = -6
Q3(S2,a1) = -4
Q4(S2,a2) = -2
Q5(S3,a1) = -4
Q6(S3,a2) = -2
Now the agent is in state S2 and he choses the action a1 with reward -1.
Consider he stays in S2, what will be the chance the a1 action to be chosen again?
Îµ=Î´=0,1 and discount factor Î³ =0.9
I think we have to use temporal difference learning