SARSA算法

机器学习与数据挖掘

范式监督学习無監督學習線上機器學習元学习（英语：Meta-learning (computer science)）半监督学习自监督学习强化学习基于规则的机器学习（英语：Rule-based machine learning）量子機器學習
问题统计分类生成模型迴歸分析聚类分析降维密度估计（英语：density estimation）异常检测数据清洗自动机器学习关联规则学习語意分析结构预测（英语：Structured prediction）特征工程表征学习排序学习（英语：Learning to rank）语法归纳（英语：Grammar induction）本体学习（英语：Ontology learning）多模态学习（英语：Multimodal learning）
监督学习 (分类 · 回归) 学徒学习（英语：Apprenticeship learning）决策树学习集成学习 Bagging 提升方法随机森林 k-NN 線性回歸朴素贝叶斯人工神经网络邏輯斯諦迴歸感知器相关向量机（RVM）支持向量机（SVM）迁移学习微调
聚类分析 BIRCH CURE算法（英语：CURE algorithm）层次 k-平均 Fuzzy 期望最大化（EM） DBSCAN OPTICS 均值飘移（英语：Mean shift）
降维因素分析 CCA ICA LDA NMF（英语：Non-negative matrix factorization） PCA PGD（英语：Proper generalized decomposition） t-SNE（英语：t-distributed stochastic neighbor embedding） SDL
结构预测（英语：Structured prediction）圖模式貝氏網路條件隨機域隐马尔可夫模型
异常检测 RANSAC k-NN 局部异常因子（英语：Local outlier factor）孤立森林（英语：Isolation forest）
人工神经网络自编码器認知計算深度学习 DeepDream（英语：DeepDream）多层感知器 RNN LSTM GRU（英语：Gated recurrent unit） ESN（英语：Echo state network）储备池计算（英语：reservoir computing）受限玻尔兹曼机 GAN SOM CNN U-Net Transformer Vision transformer（英语：Vision transformer）脉冲神经网络（英语：Spiking neural network） Memtransistor（英语：Memtransistor）电化学RAM（英语：Electrochemical RAM）（ECRAM）
强化学习 Q学习 SARSA 时序差分（TD）多智能体（英语：Multi-agent reinforcement learning） Self-play（英语：Self-play (reinforcement learning technique)） RLHF
与人类学习主动学习（英语：Active learning (machine learning)）众包 Human-in-the-loop（英语：Human-in-the-loop）
模型诊断学习曲线（英语：Learning curve (machine learning)）
数学基础内核机器（英语：Kernel machines）偏差–方差困境（英语：Bias–variance tradeoff）计算学习理论（英语：Computational learning theory）经验风险最小化奥卡姆学习（英语：Occam learning） PAC学习（英语：Probably approximately correct learning）统计学习 VC理论
大会与出版物 NeurIPS ICML（英语：International Conference on Machine Learning） ICLR ML（英语：Machine Learning (journal)） JMLR（英语：Journal of Machine Learning Research）
相关条目人工智能术语（英语：Glossary of artificial intelligence）机器学习研究数据集列表（英语：List of datasets for machine-learning research）机器学习概要（英语：Outline of machine learning）
查论编

SARSA算法是机器学习领域的一种强化学习算法，得名于“状态-动作-奖励-状态-动作”（State–Action–Reward–State–Action）的英文首字母缩写。

SARSA算法最早是由G.A. Rummery, M. Niranjan在1994年提出的，当时称为“改进型联结主义Q学习”（Modified Connectionist Q-Learning）。^[1]Richard S. Sutton（英语：Richard S. Sutton）提出了使用替代名SARSA。^[2]

SARSA算法和Q学习算法的区别主要在期望奖励Q值的更新方法上。SARSA算法使用五元组(s_t, a_t, r_t, s_t+1, a_t+1)来进行更新，其中s、a、r分别为马可夫决策过程（MDP）中的状态、动作、奖励，t和t+1分别为当前步和下一步。^[3]

算法

for each step in episode
 执行动作  $a_{t}$ ，观察奖励  $r_{t}$  和下一步状态  $s_{t+1}$ 
 基于当前的  $Q$  和  $s_{t+1}$ ，根据特定策略（如ε-greedy）选择  $a_{t+1}$ 
  $Q^{new}(s_{t},a_{t})\leftarrow Q(s_{t},a_{t})+\alpha \,[r_{t}+\gamma \,Q(s_{t+1},a_{t+1})-Q(s_{t},a_{t})]$ 
  $s_{t}\leftarrow s_{t+1}$ ； $a_{t}\leftarrow a_{t+1}$ 
until 状态  $s$  终止

在选择下一步动作 $a_{t+1}$ 时，采用ε-greedy策略，即：

以 ε 的概率随机选择下一个动作
以 1-ε 的概率选择可以最大化 $Q(s_{t+1},a_{t+1})$ 的下一个动作

在该算法中，超参数 $\alpha$ 为学习速率， $\gamma$ 为折扣因子。

在更新 $Q$ 时，对比Q学习使用 ${\text{max}}_{a}Q(s_{t+1},a)$ 作为预估，SARSA则使用 $Q(s_{t+1},a_{t+1})$ 作为预估。^[4]一些针对Q学习的提出优化方法也可以应用于SARSA上。^[5]

参考文献

^ Online Q-Learning using Connectionist Systems" by Rummery & Niranjan (1994). [2022-07-14]. （原始内容存档于2013-06-08）.
^ Jeevanandam, Nivash. Underrated But Fascinating ML Concepts #5 – CST, PBWM, SARSA, & Sammon Mapping. Analytics India Magazine. 2021-09-13 [2021-12-05]. （原始内容存档于2021-12-05）（英语）.
^ Richard S. Sutton and Andrew G. Barto. Sarsa: On-Policy TD Control. Reinforcement Learning: An Introduction. [2022-07-14]. （原始内容存档于2020-07-05）.
^ TINGWU WANG. Tutorial of Reinforcement: A Special Focus on Q-Learning (PDF). cs.toronto. [2022-07-14]. （原始内容存档 (PDF)于2022-07-14）.
^ Wiering, Marco; Schmidhuber, Jürgen. Fast Online Q(λ) (PDF). Machine Learning. 1998-10-01, 33 (1): 105–115 [2022-07-14]. ISSN 0885-6125. S2CID 8358530. doi:10.1023/A:1007562800292 . （原始内容存档 (PDF)于2018-10-30）（英语）.

可微分计算

概论

可微分编程
自動微分
张量微积分（英语：Tensor calculus）
信息几何
统计流形
神经形态工程（英语：Neuromorphic engineering）
模式识别
运算学习理论（英语：Computational learning theory）
归纳偏置

概念

梯度下降
- SGD（英语：Stochastic gradient descent）
聚类
回归
- 过拟合
幻觉
对抗（英语：Adversarial machine learning）
注意力
卷积
損失函數
反向传播
激活函数
- softmax
- sigmoid
- ReLU
正则化
数据集
扩散（英语：Diffusion process）
自回归

应用

硬件

TPU
VPU
IPU（英语：Graphcore）
憶阻器
SpiNNaker（英语：SpiNNaker）

软件库

Theano
TensorFlow
- Keras
PyTorch
JAX
Flux.jl（英语：Flux (machine-learning framework)）

实现

视觉·语音	AlexNet WaveNet 人像合成手寫识别 OCR 语音合成语音识别人脸识别 AlphaFold DALL-E Midjourney Stable Diffusion Sora Whisper（英语：Whisper (speech recognition system)）

自然语言	Word2vec Seq2seq BERT LaMDA Bard NMT 辩手项目（英语：Project Debater）沃森 GPT GPT-1 GPT-2 GPT-3 GPT-4 GPT-J（英语：GPT-J） ChatGPT 文心一言 Chinchilla AI（英语：Chinchilla AI） PaLM（英语：PaLM） BLOOM（英语：BLOOM (language model)） LLaMA TAIDE

决策	AlphaGo Q学习 SARSA OpenAI Five（英语：OpenAI Five）自动驾驶 MuZero 行动选择（英语：Action selection） Auto-GPT 机器人控制（英语：Robot control）