# Notation

### General rules

- Upper-case letters are random events or random numbers, while lower-case letters are deterministic events or deterministic variables.
- The serif typeface, such as $X$, denotes numerical values. The sans typeface, such as $\mathsfit{X}$, denotes events in general, which can be either numerical or not numerical.
- Bold letters denote vectors (such as $\mathbf{w}$) or matrices (such as $\mathbf{F}$), where matrices are always upper-case, even they are deterministic matrices.
- Calligraph letters, such as $\mathcal{X}$, denote sets.
- Fraktur letters, such as $\mathfrak{f}$, denote mappings.

### Table

In the sequel are notations throughout the book. We also occasionally follow other notations defined locally.

| English Letters | Description |
| :---: | --- |
| $A$, $a$ | advantage |
| $\mathsfit{A}$, $\mathsfit{a}$ | action |
| $\mathcal{A}$ | action space |
| $B$, $b$ | baseline in policy gradient; numerical belief in partially observable tasks; (lower case only) bonus; behavior policy in off-policy learning |
| $\mathsfit{B}$, $\mathsfit{b}$ | belief in partially observable tasks |
| $\mathfrak{B}_ \pi$, $\mathfrak{b}_ \pi$ | Bellman expectation operator of policy $\pi$ (upper case only used in distributional RL) |
| $\mathfrak{B}_ \ast$, $\mathfrak{b}_ \ast$ | Bellman optimal operator (upper case only used in distributional RL) |
| $\mathcal{B}$ | a batch of transition generated by experience replay; belief space in partially observable tasks |
| $\mathcal{B}^+$ | belief space with terminal belief in partially observable tasks |
| $c$ | counting; coefficients in linear programming |
| $\text{Cov}$ | covariance |
| $d$, $d_ \infty$ | metrics |
| $d_ f$ | $f$-divergence |
| $d_ \text{KL}$ | KL divergence |
| $d_ \text{JS}$ | JS divergence |
| $d_ \text{TV}$ | total variation |
| $D_ t$ | indicator of episode end |
| $\mathcal{D}$ | set of experience |
| $\mathrm{e}$ | the constant $\mathrm{e}$ ( $\approx2.72$ ) |
| $e$ | eligibility trace |
| $\text{E}$ | expectation |
| $\mathfrak{f}$ | a mapping |
| $\mathbf{F}$ | Fisher information matrix |
| $G$, $g$ | return |
| $\mathbf{g}$ | gradient vector |
| $h$ | action preference |
| $\text{H}$ | entropy |
| $\mathbf{I}$ | identity matrix |
| $k$ | index of iteration |
| $\ell$ | loss |
| $\mathbb{N}$ | set of natural numbers |
| $o$ | observation probability in partially observable tasks; infinitesimal in asymptotic notations |
| $O$, $\tilde{O}$ | infinite in asymptotic notations |
| $\mathsfit{O}$, $\mathsfit{o}$ | observation |
| $\mathcal{O}$ | observation space |
| $\mathbf{O}$ | zero matrix |
| $p$ | probability, dynamics |
| $\mathbf{P}$ | transition matrix |
| $\Pr$ | probability |
| $Q$, $q$ | action value |
| $Q_ \pi$, $q_ \pi$ | action value of policy $\pi$ (upper case only used in distributional RL) |
| $Q_ \ast$, $q_ \ast$ | optimal action values (upper case only used in distributional RL) |
| $\mathbf{q}$ | vector representation of action values |
| $R$, $r$ | reward |
| $\mathcal{R}$ | reward space |
| $\mathbb{R}$ | set of real numbers |
| $\mathsfit{S}$, $\mathsfit{s}$ | state |
| $\mathcal{S}$ | state space |
| $\mathcal{S}^+$ | state space with terminal state |
| $T$ | steps in an episode |
| $\mathsfit{T}$, $\Tiny\mathsfit{T}$ | trajectory |
| $\mathcal{T}$ | time index set |
| $\mathfrak{u}$ | belief update operator in partially observable tasks |
| $U$, $u$ | TD target; (lower case only) upper bound |
| $V$, $v$ | state value |
| $V_ \pi$, $v_ \pi$ | state value of the policy $\pi$ (upper case only used in distributional RL) |
| $V_ \ast$, $v_ \ast$ | optimal state values (upper case only used in distributional RL) |
| $\mathbf{v}$ | vector representation of state values |
| $\text{Var}$ | variance |
| $\mathbf{w}$ | parameters of value function estimate |
| $\mathsfit{X}$, $\mathsfit{x}$ | an event |
| $\mathcal{X}$ | event space |
| $\mathbf{z}$ | parameters for eligibility trace |
| **Greek Letters** | **Description** |
| $\alpha$ | learning rate |
| $\beta$ | reinforce strength in eligibility trace; distortion function in distributional RL |
| $\gamma$ | discount factor |
| $\mathit\Delta$, $\delta$ | TD error |
| $\varepsilon$ | parameters for exploration |
| $\eta$ | state visitation frequency |
| $\boldsymbol\upeta$ | vector representation of state visitation frequency |
| $\lambda$ | decay strength of eligibility trace |
| $\boldsymbol\uptheta$ | parameters for policy function estimates |
| $\vartheta$ | threshold for numerical iteration |
| $\uppi$ | the constant $\uppi$ ( $\approx3.14$ ) |
| $\mathit\Pi$, $\pi$ | policy |
| $\pi_ \ast$ | optimal policy |
| $\pi_ \text{E}$ | expert policy in imitation learning |
| $\rho$ | state–action visitation frequency; important sampling ratio in off-policy learning |
| $\phi$ | quantile in distributional RL |
| $\boldsymbol\uprho$ | vector representation of state–action visitation frequency |
| $\huge\tau$, $\tau$ | sojourn time of SMDP |
| $\mathit\Psi$ | Generalized Advantage Estimate (GAE) |
| $\mathit\Omega$, $\omega$ | cumulative probability in distributional RL; (lower case only) conditional probability for partially observable tasks |
| **Other Notations** | **Description** |
| $\mathbf{0}$ | zero vector |
| $\mathbf{1}$ | a vector all of whose entries are one |
| $\stackrel{\text{a.e.}}{=}$ | equal almost everywhere |
| $\stackrel{\text{d}}{=}$ | share the same distribution |
| $\stackrel{\text{def}}{=}$ | define |
| $\leftarrow$ | assign |
| $\lt$, $\le$, $\ge$, $\gt$ | compare numbers; element-wise comparison |
| $\prec$, $\preccurlyeq$, $\succcurlyeq$, $\succ$ | partial order comparison |
| $\ll$ | absolute continuous |
| $\varnothing$ | empty set |
| $\nabla$ | gradient |
| $\sim$ | obey a distribution; utility equivalence in distributional RL |
| $\left\|\quad\right\|$ | absolute value of a real number; element-wise absolute values of a vector or a matrix; the number of elements in a set |
