RL-04-03-表格型算法实现

← 上级:RL-04.实现框架与实践 · 算法:RL-03-02-算法-Q-Learning · 结构:RL-05-02-结构-Q-Table

表格型实现是验证 Bellman TD 更新的最短路径。以下为可在 gymnasium 上直接运行的 Q-LearningSARSA 完整示例(4×4 FrozenLake-v1)。


一、环境与超参

1
2
3
4
5
6
7
8
9
10
11
12
13
import gymnasium as gym
import numpy as np

env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
n_states = env.observation_space.n
n_actions = env.action_space.n

alpha = 0.8 # 学习率
gamma = 0.99
eps = 1.0
eps_min = 0.05
eps_decay = 0.999
n_episodes = 8000

is_slippery=False 时转移确定,便于观察 Q 表收敛。


二、Q-Learning 完整实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Q = np.zeros((n_states, n_actions))

def epsilon_greedy(q_row, epsilon):
if np.random.random() < epsilon:
return env.action_space.sample()
return int(np.argmax(q_row))

returns = []
for ep in range(n_episodes):
s, _ = env.reset()
done = False
ep_ret = 0.0
while not done:
a = epsilon_greedy(Q[s], eps)
s_next, r, term, trunc, _ = env.step(a)
done = term or trunc
# Off-Policy TD 目标
td_target = r + gamma * Q[s_next].max()
Q[s, a] += alpha * (td_target - Q[s, a])
s = s_next
ep_ret += r
eps = max(eps_min, eps * eps_decay)
returns.append(ep_ret)

print("Q-Learning 末 100 局平均回报:", np.mean(returns[-100:]))
print("贪婪策略:", np.argmax(Q, axis=1))

三、SARSA 完整实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Q_sarsa = np.zeros((n_states, n_actions))
eps2 = 1.0

for ep in range(n_episodes):
s, _ = env.reset()
a = epsilon_greedy(Q_sarsa[s], eps2)
done = False
while not done:
s_next, r, term, trunc, _ = env.step(a)
done = term or trunc
a_next = epsilon_greedy(Q_sarsa[s_next], eps2) if not done else 0
td_target = r + gamma * Q_sarsa[s_next, a_next]
Q_sarsa[s, a] += alpha * (td_target - Q_sarsa[s, a])
s, a = s_next, a_next
eps2 = max(eps_min, eps2 * eps_decay)

四、Q 表可视化(可选)

1
2
3
4
5
6
7
8
9
10
11
def show_policy(Q, nrow=4, ncol=4):
arrows = ["←", "↓", "→", "↑"]
for s in range(nrow * ncol):
if s == nrow * ncol - 1:
print("G", end=" ")
else:
print(arrows[np.argmax(Q[s])], end=" ")
if (s + 1) % ncol == 0:
print()

show_policy(Q)

五、工程要点

说明
状态离散 env 自带 Discrete;连续 obs 需 discretize
同步 每 episode 衰减 $\varepsilon$
收敛诊断 returns 滑动平均
确定性环境 滑冻湖无滑时 Q 应稳定到最优

六、与深度版衔接

表格 深度
Q[s,a] Q_net(obs)[a]
全表初始化 网络随机初始化
直接 TD Replay + target net

RL-04-04-DQN实现


七、小结

  • 表格实现 ≈ NumPy 二维数组 + ε-greedy + TD 一行更新
  • Q-Learning 用 max;SARSA 用 Q[s', a']
  • 下一篇:DQN 实现
-------------本文结束感谢您的阅读-------------