[{"slug":"state-space-model","title":"State Space Model","tags":["Statistics","TimeSeriesAnalysis"],"content":"State space models (SSM) describe the evolution of unobserved internal states via a state equation, and link those states to observed outputs via an observation equation. They form a core mathematical framework for forecasting and filtering in dynamic systems. Background A dynamical system can usually be described by three components: Observable system output $y$ Latent (unobservable) system state $x$ Optional system input $u$ Two common examples: View a thermometer as a system: heat is the input, air temperature is the state, and the liquid level is the output. View a micrometer as a system: the true length of the object is the state, and the micrometer reading is the output. Real systems are subject to errors, typically split into: System (process) error, related to model accuracy. For example, the thermal expansion coefficient of the liquid in a thermometer can only be approximated with finite precision. Measurement (observation) error, related to measurement precision. For example, a micrometer can only approximate the actual length of an object. To obtain accurate measurements, we often repeat observations and average them to get a more precise estimate. Assume that at time $t$ the input is $u_t$, and the mean $x_t$ is both the system state and the output $y_t$. The estimation of the true value can be written as an iterative procedure: $$ y_t = x_t = x_{t-1} + \\frac{1}{t}(u_t - x_{t-1}) $$ Each iteration refines the state estimate. This process has a very nice property: the evolution of the state $x_t$ depends only on the current state $x_{t-1}$, and is independent of past and future states $x_{t-2},\\dots,x_0$. The above process can be written as a linear discrete state space system: $$ \\begin{array}{ll} x_t = A_tx_{t-1} + B_tu_{t-1}+ \\varepsilon_t, &amp; \\varepsilon_t\\sim \\mathcal N(0,\\sigma_\\varepsilon^2) \\ y_t = C_tx_{t} + \\eta_t, &amp; \\eta_t\\sim \\mathcal N(0,\\sigma_\\eta^2)\\end{array} $$ State equation (system dynamics): Transition matrix $A_t$ maps the current state $x_{t-1}$ to the next state $x_t$. Input matrix $B_t$ determines how the input $u_{t-1}$ affects state transitions. Process noise $\\varepsilon_t$ represents the discrepancy between the estimated and true states. Observation equation (measurement): Measurement matrix $C_t$ maps the state $x_t$ to the observable output $y_t$. Measurement noise $\\eta_t$ represents the error introduced in this mapping. With a state space model we can treat three typical problems: Forecasting: given $x_0,\\dots,x_t$, predict future states $x_{t+n\\mid t}$. Filtering: given $x_0,\\dots,x_t$, reconstruct the current state $x_{t\\mid t}$ using all information up to $t$. Smoothing: given $x_0,\\dots,x_t$, reconstruct past states $x_{t-n\\mid t}$ using both past and future information. Previously we introduced ARMA models. They can be written in state space form in (at least) three equivalent ways. Hamilton form State dimension: $r = \\max(p,q+1)$ State vector: $\\boldsymbol x_t=\\begin{bmatrix}x_t&amp;x_{t-1}&amp;\\cdots&amp;x_{t-r}\\end{bmatrix}$ State equation: $$ \\boldsymbol x_t=\\begin{bmatrix} \\phi_1&amp;\\phi_2&amp;\\cdots&amp;\\phi_{r-1}&amp;\\phi_{r}\\ 1&amp;0&amp;\\cdots&amp;0&amp;0\\ 0&amp;1&amp;\\cdots&amp;0&amp;0\\ \\vdots&amp;\\vdots&amp;\\ddots&amp;\\vdots&amp;\\vdots\\ 0&amp;0&amp;\\cdots&amp;1&amp;0\\ \\end{bmatrix}\\boldsymbol x_{t-1} + \\begin{bmatrix}\\varepsilon_{t}\\0\\0\\\\vdots\\0\\end{bmatrix} $$ Observation equation: $$ y_t = \\begin{bmatrix}1&amp;\\theta_1&amp;\\theta_2&amp;\\cdots&amp;\\theta_{r-1}\\end{bmatrix}\\boldsymbol x_t $$ The state equation is an AR process: $\\phi(B)x_t=\\varepsilon_t$. The observation equation is an ARMA process: $$y_t=\\theta(B)x_t\\ \\Rightarrow\\ \\phi(B)y_t=\\theta(B)\\varepsilon_t$$ Harvey form State dimension: $r = \\max(p,q+1)$ State vector: $\\boldsymbol x_t=\\begin{bmatrix}x_t&amp;x_{t-1}&amp;\\cdots&amp;x_{t-r}\\end{bmatrix}$ State equation: $$ \\boldsymbol x_t=\\begin{bmatrix} \\phi_1&amp;1&amp;0&amp;\\cdots&amp;0\\ \\phi_2&amp;0&amp;1&amp;\\cdots&amp;0\\ \\vdots&amp;\\vdots&amp;\\vdots&amp;\\ddots&amp;\\vdots\\ \\phi_{r-1}&amp;0&amp;0&amp;\\cdots&amp;1\\ \\phi_r&amp;0&amp;0&amp;\\cdots&amp;0\\ \\end{bmatrix}\\boldsymbol x_{t-1} + \\varepsilon_{t}\\begin{bmatrix}1\\\\theta_1\\\\theta_2\\\\vdots\\\\theta_{r-1}\\end{bmatrix} $$ Observation equation: $$ y_t = \\begin{bmatrix}1&amp;0&amp;0&amp;\\cdots&amp;0\\end{bmatrix}\\boldsymbol x_t $$ Expanding the state equation recovers the ARMA representation, for example: $x_{t,r}=\\phi_rx_{t-1,1}+\\theta_{r-1}\\varepsilon_t$ $x_{t,r-1}=\\phi_{r-1}x_{t-1,1}+x_{t-1,r}+\\theta_{r-2}\\varepsilon_t = \\phi_{r-1}x_{t-1,1}+(\\phi_rx_{t-2,1}+\\theta_{r-1}\\varepsilon_{t-1})+\\theta_{r-2}\\varepsilon_t$ ... Canonical form An ARMA model can be expressed via its Green&#39;s function representation: $$ y_t = \\sum_{i=0}^\\infty\\psi_i\\varepsilon_{t-i} $$ which describes the response of the system to white-noise shocks. Here $$ \\psi_i= \\begin{cases} 1,&amp;i=0\\ \\theta_i+\\sum_{j=1}^{\\min(p,i)}\\phi_j\\psi_{i-j},&amp;i\\ge1 \\end{cases} $$ are the impulse response coefficients, i.e. how a one‑time shock at $t-i$ affects future observations. The complementary function characterizes the intrinsic behaviour of the ARMA model with external noise removed (long‑run trend, seasonality, etc.): $$ C_t(l)=y_{t+l}-\\sum\\nolimits_{j=0}^{l-1}\\psi_j\\varepsilon_{t+l-j}=C_{t-1}(l+1)+\\psi_l\\varepsilon_t $$ We can use $C_t(l)$ as the initial value for forecasting. At time $t-1$, the $n$‑step ahead forecast is $\\hat y_{t+n\\mid t-1} = C_{t-1}(n)$. This leads to the following state space system: State dimension: $r = \\max(p,q)$ State vector: $\\boldsymbol x_t=\\boldsymbol{\\hat y_t} =\\begin{bmatrix}C_t(0)&amp;C_t(1)&amp;\\cdots&amp;C_t(r-1)\\end{bmatrix}$ State equation: $$ \\boldsymbol x_t=\\begin{bmatrix} 0&amp;1&amp;0&amp;\\cdots&amp;0\\ 0&amp;0&amp;1&amp;\\cdots&amp;0\\ \\vdots&amp;\\vdots&amp;\\vdots&amp;\\ddots&amp;\\vdots\\ 0&amp;0&amp;0&amp;\\cdots&amp;1\\ \\phi_r&amp;\\phi_{r-1}&amp;\\phi_{r-2}&amp;\\cdots&amp;\\phi_{1}\\ \\end{bmatrix}\\boldsymbol x_{t-1} + \\varepsilon_{t}\\begin{bmatrix}1\\\\psi_1\\\\psi_2\\\\vdots\\\\psi_{r-1}\\end{bmatrix} $$ Observation equation: $$ y_t = \\begin{bmatrix}1&amp;0&amp;0&amp;\\cdots&amp;0\\end{bmatrix}\\boldsymbol x_t $$ This form gives a compact expression for forecasting: $\\hat x_{t\\mid t-1} = \\hat y_t = y_t - \\varepsilon_t = \\sum_{i=0}^\\infty\\psi_i\\varepsilon_{t-i} - \\varepsilon_t=\\sum\\nolimits_{i=1}^\\infty\\psi_i\\varepsilon_{t-i}$ $x_{t+1} = \\varepsilon_{t+1} + \\psi_1\\varepsilon_{t} + \\psi_2\\varepsilon_{t-1} + \\psi_3\\varepsilon_{t-2} + \\cdots$ which can be rearranged as $$x_{t+1}=\\varepsilon_{t+1}+\\psi_1\\varepsilon_t+\\phi\\hat x_{t\\mid t-1}$$ $\\hat x_{t+1\\mid t}=\\phi\\hat x_{t\\mid t-1}+\\psi_1\\varepsilon_t$ $x_t = \\hat x_{t\\mid t-1} + \\varepsilon_t$ In practice, Hamilton and Harvey forms are more commonly used for implementation. The main differences are: In Harvey form, intercepts appear in the state equation; in Hamilton form, intercepts appear in the observation equation. Harvey form naturally supports integration/differencing; Hamilton form is typically applied to already differenced (stationary) series. Most modern references present ARMA state space models in Harvey form; we follow that convention here. Kalman Filter Linear projection theorem In probability and statistics, the linear projection of one random vector onto the span of another can be understood as its “shadow” in that space: the best linear approximation of $x$ using $y$. Given random vectors $x$ and $y$ (possibly living in different vector spaces), we look for the best linear approximation $$E(x\\mid y) \\approx \\beta + \\gamma y$$ in some squared‑error sense. This projection retains as much information about $x$ as possible within the linear span of $y$. Linear projections underpin: Regression analysis: projecting observations onto regressors to build prediction models. Feature extraction: projecting high‑dimensional features into lower‑dimensional subspaces. Signal processing: denoising or reconstructing signals. For a jointly normal vector $$ E\\begin{pmatrix}x\\y\\end{pmatrix}=\\begin{pmatrix}\\mu_x\\\\mu_y\\end{pmatrix}, \\quad Var\\begin{pmatrix}x\\y\\end{pmatrix}=\\begin{pmatrix}\\Sigma_{xx}&amp;\\Sigma_{xy}\\\\Sigma_{yx}&amp;\\Sigma_{yy}\\end{pmatrix} $$ the conditional distribution is still normal with $E(x\\mid y) = \\mu_x+ \\Sigma_{xy}\\Sigma_{yy}^{-1}(y-\\mu_y)$ $Var(x\\mid y) = \\Sigma_{xx} -\\Sigma_{xy}\\Sigma_{yy}^{-1}\\Sigma_{xy}&#39;$ This is the key identity used in Kalman filtering. State space formulation For time series we typically use the following linear Gaussian state space model: $$ \\begin{array}{llll} y_t &amp; = Z_t \\alpha_t + d_t + \\varepsilon_t, &amp; \\varepsilon_t\\sim \\mathcal N(0,H_t), \\[4pt] \\alpha_t &amp; = T_t \\alpha_{t-1} + c_t + R_t \\eta_t, &amp; \\eta_t\\sim \\mathcal N(0,Q_t), \\[4pt] &amp; &amp; \\alpha_1\\sim \\mathcal N(a_1,P_1). \\end{array} $$ Measurement equation $y_t$: observed vector, $p\\times 1$ $d_t$: measurement intercept, $p\\times 1$ $\\varepsilon_t$: measurement disturbance, $p\\times 1$ $Z_t$: design matrix, $p\\times m$ $H_t$: observation covariance, $p\\times p$ State equation $\\alpha_t$: state vector, $m\\times 1$ $c_t$: state intercept, $m\\times 1$ $\\eta_t$: state disturbance, $r\\times 1$ $T_t$: transition matrix, $m\\times m$ $R_t$: selection matrix, $m\\times r$ $Q_t$: state disturbance covariance, $r\\times r$ Initialization $a_1$: prior mean of the state, $m\\times 1$ $P_1$: prior covariance of the state, $m\\times m$ The additional matrix $R_t$ selects which components of the disturbance vector $\\eta_t$ actually enter the state equation: if we regard $\\eta_t$ as the full set of potential shocks, then $R_t\\eta_t$ is the subset that drives the state. To ensure a valid covariance $Var(R_t\\eta_t) = R_tQ_tR_t&#39;$, we require $Q_t$ to be positive semidefinite and $R_t$ to have full column rank. We typically treat the initial state mean $a_1$ and covariance $P_1$ as given (or estimated via separate initialization schemes described later). If $y_t$ is a linear function of $\\alpha_1$, ${\\varepsilon_t}$ and ${\\eta_t}$, then the system is linear. To keep the model linear at all times, the system matrices $Z_t,d_t,H_t,T_t,c_t,R_t,Q_t$ must be deterministic; they may vary with $t$, but only in a pre‑specified way. If these matrices are constant over time, the system is time‑invariant. A stationary AR process is a special case of a time‑invariant state space model. Kalman filter The Kalman filter is an iterative algorithm that provides the optimal (minimum mean‑square error) estimate of the state vector given all information up to time $t$. Let $Y_{t-1} = {y_{t-1},\\dots,y_1}$ denote the information set at time $t$ (for $t\\ge2$). At $t=1$, $Y_0=\\emptyset$. The filter constructively builds the conditional distributions of $\\alpha_t$ and $y_t$ via forward recursion: Prediction step: $p(y_t\\mid \\alpha_t) = p(y_t\\mid \\alpha_1,\\dots,\\alpha_t, Y_{t-1})$. Update step: $p(\\alpha_{t+1}\\mid \\alpha_t) = p(\\alpha_{t+1}\\mid \\alpha_1,\\dots,\\alpha_t, Y_t)$. We aim to compute, given $Y_t$: Conditional mean of the state: $a_{t\\mid t} = E(\\alpha_t\\mid Y_t)$ Conditional covariance: $P_{t\\mid t} = Var(\\alpha_{t}\\mid Y_t)$ One‑step ahead state prediction: $a_{t+1} = E(\\alpha_{t+1}\\mid Y_t)$ Its covariance: $P_{t+1} = Var(\\alpha_{t+1}\\mid Y_t)$ When $y_t$ is not yet observed, the optimal prediction of the observation is $$ E(y_t\\mid Y_{t-1})=E(Z_t\\alpha_t+d_t+\\varepsilon_t\\mid Y_{t-1})=Z_ta_t+d_t. $$ Once $y_t$ is observed, the prediction error (innovation) is $$ v_t = y_t - E(y_t\\mid Y_{t-1})=y_t-Z_ta_t-d_t. $$ This is the one‑step‑ahead forecast error of $y_t$ given $Y_{t-1}$. It satisfies $E(v_t\\mid Y_{t-1})=0$ $Cov(y_j,v_t)=0$ for $j\\le t-1$ Unconditionally, $E(v_t)=0$. By the linear projection theorem, we have $$ \\begin{aligned} a_{t\\mid t} &amp;=E(\\alpha_t\\mid Y_t) \\ &amp;= E(\\alpha_t\\mid v_t,Y_{t-1}) \\ &amp;= E(\\alpha_t\\mid Y_{t-1})+Cov(\\alpha_t,v_t\\mid Y_{t-1})Var(v_t\\mid Y_{t-1})^{-1}(v_t-E(v_t\\mid Y_{t-1})). \\end{aligned} $$ Similarly, $$ \\begin{aligned} P_{t\\mid t} &amp;=Var(\\alpha_t\\mid Y_t) \\ &amp;=Var(\\alpha_t\\mid Y_{t-1})-Cov(\\alpha_t,v_t\\mid Y_{t-1})Var(v_t\\mid Y_{t-1})^{-1}Cov(\\alpha_t,v_t\\mid Y_{t-1})&#39;. \\end{aligned} $$ Define $Cov(\\alpha_t,v_t\\mid Y_{t-1})=E[\\alpha_t(Z_t\\alpha_t + d_t + \\varepsilon_t-Z_ta_t-d_t)&#39;\\mid Y_{t-1}] = E[\\alpha_t(\\alpha_t-a_t)&#39;Z_t&#39;\\mid Y_{t-1}]=P_tZ_t&#39;$ $F_t = Var(v_t\\mid Y_{t-1}) = Z_tP_tZ_t&#39;+H_t$ Then $a_{t\\mid t}=a_t + P_tZ_t&#39;F_t^{-1}v_t$ $P_{t\\mid t} = P_t-P_tZ_t&#39;F_t^{-1}Z_tP_t$ For the prediction step, $a_{t+1} = E(T_t\\alpha_t + c_t +R_t\\eta_t\\mid Y_t)=T_tE(\\alpha_t\\mid Y_t)+ c_t=T_ta_{t\\mid t}+c_t$ $P_{t+1} = Var(T_t\\alpha_t + c_t + R_t\\eta_t\\mid Y_t)=T_tP_{t\\mid t}T_t&#39;+R_tQ_tR_t&#39;$ It is common to denote the Kalman gain as $$K_t=P_tZ_t&#39;F_t^{-1}.$$ Then $a_{t\\mid t}=a_t+K_tv_t$ and $a_{t+1}=T_ta_{t\\mid t}+c_t$. Collecting the main recursions: Innovation and its variance $v_t=y_t-Z_ta_t-d_t$ $F_t=Z_tP_tZ_t&#39;+H_t$ Update step $a_{t\\mid t}=a_t + K_tv_t$ $P_{t\\mid t} = P_t-P_tZ_t&#39;F_t^{-1}Z_tP_t$ Prediction step $a_{t+1}=T_ta_{t\\mid t}+c_t $ $P_{t+1}=T_tP_{t\\mid t}T_t&#39; + R_tQ_tR_t&#39;$ To improve numerical stability it is common to: Use the Joseph form for the covariance update: $$P_{t\\mid t}=(I-K_tZ_t)P_t(I-K_tZ_t)&#39;+K_tH_tK_t&#39;.$$ Enforce symmetry of $P_{t+1}$ by replacing it with $(P_{t+1} + P_{t+1}&#39;)/2$. The prediction error $v_t$ is central: the larger the discrepancy between $y_t$ and its forecast $E(y_t\\mid Y_{t-1})$, the larger the correction to the state estimate. Computationally, the Kalman filter is often much cheaper than direct least squares on stacked data. For a regression with $t$ observations and observation dimension $p$, OLS in one shot inverts a $pt\\times pt$ matrix, whereas the Kalman filter only inverts $t$ matrices of size $p\\times p$ (often with $p=1$). If the system matrices $Z_t,H_t,T_t,R_t,Q_t$ are constant, the sequence $P_t$ converges to a steady‑state covariance $\\bar P$ that solves $$ \\bar P = T\\bar PT&#39; -T\\bar PZ&#39;\\bar F^{-1}Z\\bar PT&#39; + RQR&#39;,\\quad \\bar F = Z\\bar PZ&#39; . $$ Using $\\bar P$ directly can greatly reduce computation in long series. Define the state prediction error $x_t = \\alpha_t - a_t$; its covariance is $Var(x_t)=P_t$. From $$ v_t = Z_t\\alpha_t+d_t+\\varepsilon_t-(Z_ta_t+d_t) = Z_tx_t+\\varepsilon_t $$ and the prediction recursion we obtain $$ x_{t+1} = \\alpha_{t+1}-a_{t+1}=L_tx_t+R_t\\eta_t-K_t\\varepsilon_t,\\quad L_t=T_t-K_tZ_t. $$ The pair $(x_t,v_t)$ is sometimes called the innovation analogue of the state space model, and we can show $$ P_{t+1}=T_tP_tL_t&#39;+R_tQ_tR_t&#39;. $$ Because the innovations $v_t$ are Gaussian and serially independent, the joint density of the observations factorizes as $$ p(y_1,\\dots,y_n)=p(y_1)\\prod\\nolimits_{t=2}^np(y_t\\mid Y_{t-1}) $$ or, equivalently, $$ p(v_1,\\dots,v_n)=\\prod\\nolimits_{t=1}^np(v_t), $$ with $v_t$ independent of $Y_{t-1}$. State smoothing Smoothing problems come in three flavours: Fixed‑interval: estimate a particular state using all available data, $E(\\alpha_t\\mid y_1,\\dots,y_s)$. Fixed‑point: update the estimate of a given state as more data arrive, $\\hat\\alpha_{t\\mid n}=E(\\alpha_t\\mid Y_n)$ for $n&gt;t$. Fixed‑lag: re‑estimate a state with a fixed lag using future data, $\\hat\\alpha_{n-j\\mid n}=E(\\alpha_{n-j}\\mid Y_n)$ for $n&gt;j$. Given $Y_n$ (with $n\\ge t$), we want Smoothed mean: $\\hat a_t = E(\\alpha_t\\mid Y_n)$ Smoothed covariance: $V_t = Var(\\alpha_{t}\\mid Y_n)$ Define $r_t$ as a weighted sum of future innovations $v_{t+1},\\dots,v_n$: $$ r_t=Z_{t+1}&#39;F_{t+1}^{-1}v_{t+1}+L_{t+1}&#39;Z_{t+2}&#39;F_{t+2}^{-1}v_{t+2}+\\cdots+L_{t+1}&#39;\\cdots L_{n-1}&#39;Z_n&#39;F_n^{-1}v_n. $$ Conditioning on $Y_n$, the past information $Y_{t-1}$ is fixed, and we can write $$ \\hat \\alpha_t = a_t+P_tZ_t&#39;F_t^{-1}v_t+P_tL_t&#39;Z_{t+1}&#39;F_{t+1}^{-1}v_{t+1}+\\cdots+P_tL_t&#39;\\cdots L_{n-1}&#39;Z_n&#39;F_n^{-1}v_n. $$ This leads to the backward recursions (with $r_n=0$): $r_{t-1}=Z_t&#39;F_t^{-1}v_t+L_t&#39;r_t$ $\\hat \\alpha_t = a_t+P_tr_{t-1}$ Similarly, let $N_t$ collect second‑order terms (with $N_n=0$): $N_{t-1}=Z_t&#39;F_t^{-1}Z_t+L_t&#39;N_tL_t$ $V_t = P_t-P_tN_{t-1}P_t$ These four equations are the classic state smoothing recursions. They run backwards in time and complement the forward Kalman filter. In implementation we usually cache $v_t, F_t, K_t, a_t, P_t$ during filtering, then reuse them in smoothing. One can also recompute $v_t, F_t, K_t$ from $a_t, P_t$ if memory is a concern. Disturbance smoothing Given $Y_n$ we may also want smoothed estimates of the disturbances $\\varepsilon_t$ and $\\eta_t$: $\\hat \\varepsilon_t = E(\\varepsilon_t\\mid Y_{t-1},v_t,\\dots,v_n)$ $Var(\\varepsilon_t\\mid Y_n)$ $\\hat\\eta_t = E(\\eta_t\\mid Y_{t-1},v_t,\\dots,v_n)$ $Var(\\eta_t\\mid Y_n)$ Using linear projection identities, these can be expressed in terms of the smoothed quantities. A convenient form uses the auxiliary sequences $$u_t = F_t^{-1}v_t-K_t&#39;r_t,\\qquad D_t = F_t^{-1}+K_t&#39;N_tK_t.$$ Then, for $t=n,\\dots,1$: Disturbance in the observation equation $\\hat \\varepsilon_t = H_tu_t$ $Var(\\varepsilon_t\\mid Y_n) = H_t-H_tD_tH_t$ and with the additional backward recursions $$ \\begin{aligned} r_{t-1}&amp;=Z_t&#39;u_t+T_t&#39;r_t,\\ N_{t-1}&amp;=Z_t&#39;D_tZ_t+T_t&#39;N_tT_t-Z_t&#39;K_t&#39;N_tT_t-T_t&#39;N_tK_tZ_t, \\end{aligned} $$ we obtain, again for $t=n,\\dots,1$: Disturbance in the state equation $\\hat \\eta_t = Q_tR_t&#39;r_t$ $Var(\\eta_t\\mid Y_n) = Q_t-Q_tR_t&#39;N_tR_tQ_t$ Disturbance smoothing is especially useful for diagnostic checking and for computing exact likelihoods in diffuse initialization. Initialization Reference implementation (for example): https://github.com/statsmodels/statsmodels/blob/589f167fed77ebf6031d01ad3de1aa7b0040ced3/statsmodels/tsa/statespace/initialization.py We now consider how to set the initial mean $a_1$ and covariance $P_1$. A general parameterization is $$ \\alpha_1 = a + A\\delta+R_0\\eta_0,\\qquad \\eta_0\\sim\\mathcal N(0,Q_0), $$ where $a$ is a deterministic constant part. $A\\delta$ represents a nonstationary part: $\\delta$ is a $q\\times1$ vector. $A$ is an $m\\times q$ selection matrix. $R_0\\eta_0$ represents a stationary part: $\\eta_0$ is an $(m-q)\\times1$ vector. $R_0$ is an $m\\times(m-q)$ selection matrix. This covers four common initialization schemes. Known (fixed) initialization If we have strong prior knowledge about certain state components, we can encode it in $a$ directly. If we have no such information, we can simply set $a=0$. Stationary initialization If $\\alpha_t$ is stationary, all state components share constant unconditional means and variances. In that case we can use the known $R_0, Q_0$ to construct the unconditional mean and covariance of $\\alpha_1$. Diffuse initialization When we have no prior information at all, we can represent our ignorance by assigning a very large variance to some state components and letting the filter learn them from the data. Treat the unknown fixed vector $\\delta$ as $\\delta \\sim \\mathcal N(0, \\kappa I_q)$ with $\\kappa$ large. Then $$ a_1 = E(\\alpha_1)=a,\\qquad P_1 = Var(\\alpha_1) = Var(A\\delta) + Var(R_0\\eta_0) = \\kappa P_{\\infty,1}+P_{*,1}, $$ with $P_{\\infty,1}=AA&#39;$ and $P_{,1}=R_0Q_0R_0&#39;$. As $\\kappa\\to\\infty$, the variance splits into a diffuse part $P_{\\infty,1}$ and a proper part $P_{,1}$. After a finite number of steps $d$ (depending on the number of diffuse components), the diffuse part is fully absorbed and $P_{\\infty,d}=0$; from then on the filter behaves like a standard Kalman filter. Diffuse initialization is simple but can suffer from numerical rounding error, and if the sample is too short (so diffusion is incomplete) the model can become degenerate. Mixed initialization To reduce the uncertainty introduced by fully diffuse priors, we often split $\\alpha_1$ into blocks: Some components are initialized with fixed constants. Some components are initialized from a stationary distribution. The remaining unknown components are given diffuse priors. This also highlights a key practical difference between Harvey and Hamilton representations: In Harvey form, once differences are included the state vector is non‑stationary and usually requires diffuse or mixed initialization. In Hamilton form the implied process may be stationary and can often be initialized from its unconditional distribution. Exact diffuse initialization We use $O(\\kappa^{-j})$ to denote terms that vanish at rate $\\kappa^{-j}$ as $\\kappa\\to\\infty$. In that regime we can expand $$ P_t =\\kappa P_{\\infty,t}+P_{*,t}+O(\\kappa^{-1}),\\quad t=1,\\dots,n. $$ For a non‑degenerate model there exists $d$ such that $$ P_{\\infty,t}\\ne 0\\ (t\\le d),\\qquad P_{\\infty,t}=0\\ (t&gt;d), $$ and from then on $P_t=P_{*,t}$. Let $\\delta$ denote the $q$‑dimensional diffuse part of $\\alpha_1$, with prior density $$ \\log p(\\delta) = -\\frac q2\\log2\\pi-\\frac q2\\log\\kappa-\\frac1{2\\kappa}\\delta&#39;\\delta. $$ The joint density with the data is $\\log p(\\delta,Y_t)$; conditioning on $Y_t$ and differentiating with respect to $\\delta$ yields a quadratic form whose maximizer $\\tilde\\delta$ is the conditional mean. Its (negative) Hessian gives the conditional covariance of $\\delta$ once diffusion is complete ($t&gt;d$). This provides an exact, non‑asymptotic estimate of the diffuse variance. If the sample does not contain enough information to complete the diffusion phase, the model is effectively non‑identifiable (degenerate). In practice we run the filter on the first $d$ observations keeping track of the diffuse part $P_{\\infty,t}$; once $P_{\\infty,t}$ hits zero we treat $a_{d+1}$ and $P_{d+1}=P_{*,d+1}$ as the starting values of a standard Kalman filter. This is called exact diffuse initialization. Stationary initialization from model parameters If the state follows a stationary, time‑invariant transition, $$ \\alpha_t = T\\alpha_{t-1} + c + R\\eta_t, $$ then the unconditional mean and covariance are determined solely by $c,T,R,Q$ and can be used as $a_1,P_1$ in the filter. The unconditional mean solves $$ a_1=Ta_1+c\\ \\Rightarrow\\ a_1 = (I-T)^{-1}c. $$ The unconditional covariance solves the discrete Lyapunov equation $$ P_1=TP_1T&#39;+RQR&#39;. $$ Using vectorization and the Kronecker product, $$ ext{vec}(TP_1T&#39;) = (T\\otimes T) \\text{vec}(P_1), $$ so $$ ext{vec}(P_1)=[I-T\\otimes T]^{-1}\\text{vec}(RQR&#39;). $$ Both equations can be solved efficiently using standard numerical routines. Practical aspects Regression in state space form We can incorporate exogenous regressors into the measurement equation, $$ y_t = Z_t\\alpha_t + X_t\\beta + \\varepsilon_t, $$ where $\\beta$ is $k\\times1$ and $X_t$ is $p\\times k$. A common construction is to treat $\\beta$ as part of the state: $$ \\begin{aligned} y_t &amp;= \\begin{bmatrix}Z_t &amp;X_t\\end{bmatrix}\\begin{pmatrix}\\alpha_t\\ \\beta_t\\end{pmatrix} + \\varepsilon_t, \\ \\begin{pmatrix}\\alpha_{t+1}\\ \\beta_{t+1}\\end{pmatrix} &amp;= \\begin{bmatrix}T_t &amp; 0\\0 &amp;I_k\\end{bmatrix}\\begin{pmatrix}\\alpha_t\\ \\beta_t\\end{pmatrix}+\\begin{bmatrix}R_t \\0\\end{bmatrix}\\eta_t. \\end{aligned} $$ With diffuse initialization the joint prior is $$ \\begin{pmatrix}\\alpha_1\\ \\beta_1\\end{pmatrix} \\sim \\mathcal N \\Bigg( \\begin{pmatrix}a\\0\\end{pmatrix},\\kappa \\begin{bmatrix}P_\\infty &amp; 0\\0 &amp;I_k\\end{bmatrix}+ \\begin{bmatrix}P_* &amp; 0\\0 &amp;0\\end{bmatrix} \\Bigg). $$ We can then define two types of residuals: Recursive residuals: $v_t=y_t-Z_t\\alpha_t-X_t\\hat\\beta_{t-1}$ ($t=d+1,\\dots,n$), where $\\hat\\beta_{t-1}$ comes from the state estimate. OLS residuals: $v_t^+=y_t-Z_t\\alpha_t-X_t\\hat\\beta$ ($t=d+1,\\dots,n$), where $\\hat\\beta$ is estimated using all data. Recursive residuals are true innovations (serially uncorrelated), while OLS residuals incorporate all information at once. Both are useful for diagnostics. Sequential processing (univariate treatment of multivariate series) In the basic model the observation dimension $p$ is fixed and $H_t$ is a full covariance matrix. For high‑dimensional series this can be expensive. A common trick is to process multivariate observations one component at a time. Assume now that The observation dimension $p_t$ may vary with $t$. $H_t$ is diagonal. The forecast variance $F_t$ may be singular. Write $$ \\begin{aligned} y_t &amp;= \\begin{pmatrix}y_{t,1}\\\\vdots\\ y_{t,p_t}\\end{pmatrix}, &amp; \\varepsilon_t &amp;= \\begin{pmatrix}\\varepsilon_{t,1}\\\\vdots\\ \\varepsilon_{t,p_t} \\end{pmatrix}, &amp; Z_t &amp;= \\begin{pmatrix}Z_{t,1}\\\\vdots\\ Z_{t,p_t} \\end{pmatrix}, &amp; H_t &amp;= \\begin{pmatrix}\\sigma^2_{t,1}&amp;0&amp;0\\0&amp;\\ddots&amp;0\\ 0&amp;0&amp;\\sigma^2_{t,p_t} \\end{pmatrix}. \\end{aligned} $$ We can then re‑express the model as a sequence of scalar measurement updates: $$ \\begin{aligned} y_{t,i} &amp;= Z_{t,i} \\alpha_{t,i} + \\varepsilon_{t,i}, &amp;&amp; i=1,\\dots,p_t,\\ \\alpha_{t,i+1} &amp;=\\alpha_{t,i}, &amp;&amp; i=1,\\dots,p_t-1,\\ \\alpha_{t+1,1} &amp;= T_t \\alpha_{t,p_t} + R_t \\eta_t, &amp;&amp; t=1,\\dots,n,\\ \\alpha_{1,1}&amp;\\sim\\mathcal N(a_1,P_1). \\end{aligned} $$ Define $$ \\begin{aligned} a_{t,1}&amp;=E(\\alpha_{t,1}), &amp; P_{t,1}&amp;=Var(\\alpha_{t,1}\\mid Y_{t-1}),\\ a_{t,i}&amp;=E(\\alpha_{t,i}\\mid Y_{t-1},y_{t,1},\\dots,y_{t,i-1}), &amp; P_{t,i}&amp;=Var(\\alpha_{t,i}\\mid Y_{t-1},y_{t,1},\\dots,y_{t,i-1}). \\end{aligned} $$ The forward recursions become, for $i=1,\\dots,p_t$: Innovations $v_{t,i}=y_{t,i}-Z_{t,i}a_{t,i}$ $F_{t,i}=Z_{t,i}P_{t,i}Z_{t,i}&#39;+\\sigma^2_{t,i}$ $K_{t,i}=P_{t,i}Z_{t,i}&#39;F_{t,i}^{-1}$ Update $a_{t,i+1}=a_{t,i} + K_{t,i}v_{t,i}$ $P_{t,i+1} = P_{t,i}-K_{t,i}F_{t,i}K_{t,i}&#39;$ Prediction $a_{t+1,1}=T_ta_{t,p_t+1}+c_t$ $P_{t+1,1}=T_tP_{t,p_t+1}T_t&#39; + R_tQ_tR_t&#39;$ Note that the vector innovation $v_t$ in the standard multivariate filter is not simply $(v_{t,1},\\dots,v_{t,p_t})&#39;$; only the first components coincide. Likewise for $F_t$ versus ${F_{t,i}}$. In the scalar treatment, $F_{t,i}$ is allowed to be zero; when that happens, the corresponding observation is linearly redundant given the information set and can be skipped. Backward recursions can be written similarly (with $r_{n,p_n}=0$, $N_{n,p_n}=0$ and $L_{t,i}=I_m-K_{t,i}Z_{t,i}$): $r_{t,i-1}=Z_{t,i}&#39;F_{t,i}^{-1}v_{t,i}+L_{t,i}&#39;r_{t,i}$ $r_{t-1,p_t-1}=T_{t-1}&#39;r_{t,0}$ $N_{t,i-1}=Z_{t,i}&#39;F_{t,i}^{-1}Z_{t,i}+L_{t,i}&#39;N_{t,i}L_{t,i}$ $N_{t-1,p_t-1}=T_{t-1}N_{t,0}T_{t-1}&#39;$ With $a_t=a_{t,1}, P_t=P_{t,1}, r_{t-1}=r_{t,0}, N_{t-1}=N_{t,0}$ we recover the usual smoothing formulas $\\hat \\alpha_t = a_t+P_tr_{t-1}$ $V_t = P_t-P_tN_{t-1}P_t$ For observation disturbance smoothing we have $\\hat \\varepsilon_{t,i}=\\sigma_{t,i}^2F_{t,i}^{-1}(v_{t,i}-K_{t,i}&#39;r_{t,i})$ $Var(\\hat \\varepsilon_{t,i})=\\sigma_{t,i}^4F_{t,i}^{-2}(F_{t,i}-K_{t,i}&#39;N_{t,i}K_{t,i})$ If $H_t$ is not diagonal (observation disturbances correlated across components), this univariate formulation does not apply directly. In that case we can apply a Cholesky transformation $$H_t=C_tH_t^*C_t&#39;$$ with diagonal $H_t^*$ and lower‑triangular $C_t$ with ones on the diagonal, and transform the system via $y_t^*=C_t^{-1}y_t$, $Z_t^*=C_t^{-1}Z_t$, $\\varepsilon_t^*=C_t^{-1}\\varepsilon_t\\sim\\mathcal N(0,H_t^*)$. Observation collapsing When the observation dimension $p$ is very large, inverting $F_t$ can be very expensive. If $H_t$ is nonsingular and diagonal, $P_t$ is nonsingular, and $m\\ll p$, we can use the matrix identity $$ F_t^{-1}=(Z_tP_tZ_t&#39;+H_t)^{-1}=H_t^{-1}-H_t^{-1}Z_t(P_t^{-1}+Z_t&#39;H_t^{-1}Z_t)^{-1}Z_t&#39;H_t^{-1}. $$ Intuitively, we can split the $p\\times 1$ observation vector into two parts: $y_t^*$, an $m\\times 1$ vector that is informative about $\\alpha_t$ and will form a new observation equation. $y_t^+$, a $(p-m)\\times 1$ vector that is redundant for the state and can be absorbed into the noise. Let $A_t^*=(Z_tH_tZ_t&#39;)^{-1}Z_t&#39;H_t^{-1}$ and set $y_t^*=A_t^*y_t$. This can be interpreted as the least‑squares estimate of $\\alpha_t$ given $y_t$. Choose $B_t$ so that $A_t^+=B_t(I-Z_tA_t^*)$ has full row rank and satisfies $A_t^Z_t=I_p$, $A_t^+Z_t=0$. Define $y_t^+=A_t^+y_t$, $\\varepsilon_t^=A_t^*\\varepsilon_t$, $\\varepsilon_t^+=A_t^+\\varepsilon_t$. Then $$ \\begin{pmatrix}y_t^*\\y_t^+\\end{pmatrix} = \\begin{bmatrix}A_t^*\\A_t^+\\end{bmatrix}y_t=\\begin{pmatrix}\\alpha_t \\0\\end{pmatrix}+ \\begin{pmatrix}\\varepsilon_t^* \\ \\varepsilon_t^+\\end{pmatrix}. $$ Because $Cov(\\varepsilon_t^+,\\varepsilon_t^*)=0$, the second equation does not involve the state and can be dropped. We obtain the collapsed state space system $$ \\begin{aligned} y_t^*&amp;=\\alpha_t+\\varepsilon_t^*,&amp;\\varepsilon_t^*&amp;\\sim\\mathcal N(0,H_t^*),\\ \\alpha_{t+1}&amp;=T_t\\alpha_t+R_t\\eta_t,&amp;\\eta_t&amp;\\sim\\mathcal N(0,Q_t), \\end{aligned} $$ where $H_t^*=A_t^H_tA_t^$. If $m\\ll p$, this can yield large computational savings, especially in time‑invariant models where $A_t^*,H_t^*$ can be precomputed. With an additional whitening matrix $C_t$ such that $C_t&#39;C_t=(Z_tH_tZ_t&#39;)^{-1}$, we can write an equivalent system with identity observation variance: $$ \\bar y_t^*=Z_t^*\\alpha_t+\\bar\\varepsilon_t^*,\\quad \\bar\\varepsilon_t^*\\sim\\mathcal N(0,I_p), $$ where $\\bar A_t^* = C_tZ_tH_t^{-1}$ $Z_t^* = \\bar A_t^* Z_t = C_tZ_t&#39;H_t^{-1}Z_t=C_t&#39;^{-1}$ Likelihood Without diffuse initialization, the likelihood of $Y_n=(y_1,\\dots,y_n)$ can be written in terms of the innovations: $L(Y_n) = p(y_1,\\dots,y_n)=p(y_1)\\prod_{t=2}^np(y_t\\mid Y_{t-1})$ $\\log L(Y_n) = \\sum_{t=1}^n\\log p(y_t\\mid Y_{t-1})=-\\frac{np}2\\log2\\pi-\\frac12\\sum_{t=1}^n(\\log|F_t|+v_t&#39;F_t^{-1}v_t)$ With diffuse initialization we must account for the contribution of the diffuse components. Write $$ F_t =\\kappa F_{\\infty,t}+F_{*,t}+O(\\kappa^{-1}),\\qquad F_{\\infty,t}=Z_tP_{\\infty,t}Z_t&#39;. $$ Then the diffuse log‑likelihood is $$ \\begin{aligned} \\log L_d(Y_n) &amp;= \\lim_{\\kappa\\to\\infty} \\Big[\\log L(Y_n)+\\tfrac q2\\log\\kappa\\Big]\\ &amp;= -\\tfrac{np}2\\log2\\pi-\\tfrac12\\sum_{t=1}^dw_t-\\tfrac12\\sum_{t=d+1}^n(\\log|F_t|+v_t&#39;F_t^{-1}v_t), \\end{aligned} $$ with $$ w_t=\\begin{cases} \\log|F_{\\infty,t}|, &amp; F_{\\infty,t} \\text{ positive definite},\\ \\log|F_{,t}|+v_t^{(0)&#39;}F_{,t}^{-1}v_t^{(0)}, &amp; F_{\\infty,t} = 0. \\end{cases} $$ If we treat $\\delta$ as an unknown fixed vector rather than a random variable, we can integrate it out and work with a concentrated likelihood $\\log L_c(Y_n)$ involving only the proper part $F_{\\delta,t}$. Parameter estimation and variance Once we have the likelihood we can estimate model parameters by maximum likelihood. The main practical issues for ARIMA‑type models are Enforcing parameter constraints (e.g. stationarity, invertibility) during optimization. Efficiently computing gradients or using derivative‑free optimizers. For a parameter vector $\\boldsymbol\\theta$ with log‑likelihood $\\log L(\\boldsymbol\\theta)$, the Fisher information matrix is $$ \\mathcal{I}(\\boldsymbol\\theta) = -E\\Big[\\frac{\\partial^2\\log L(\\boldsymbol\\theta)}{\\partial\\boldsymbol\\theta\\partial\\boldsymbol\\theta^T}\\Big]. $$ For large samples the covariance matrix of the MLE satisfies $$ Var(\\hat{\\boldsymbol\\theta}) \\approx \\mathcal{I}(\\hat{\\boldsymbol\\theta})^{-1}. $$ In practice we approximate $\\mathcal{I}$ using either the negative Hessian of the log‑likelihood $$ \\mathcal{I}(\\hat{\\boldsymbol\\theta}) \\approx -\\frac{\\partial^2\\log L(\\hat{\\boldsymbol\\theta})}{\\partial\\boldsymbol\\theta\\partial\\boldsymbol\\theta^T} \\approx -\\frac1n\\sum_{i=1}^n \\frac{\\partial^2\\log f(x_i;\\hat{\\boldsymbol\\theta})}{\\partial\\boldsymbol\\theta\\partial\\boldsymbol\\theta^T}, $$ or the outer product of gradients $$ \\mathcal{I}(\\hat{\\boldsymbol\\theta}) \\approx \\frac1n\\sum_{i=1}^n \\bigg(\\frac{\\partial\\log f(x_i;\\boldsymbol\\theta)}{\\partial\\boldsymbol\\theta}\\bigg)\\bigg(\\frac{\\partial\\log f(x_i;\\boldsymbol\\theta)}{\\partial\\boldsymbol\\theta}\\bigg)^T. $$ If analytic derivatives are unavailable we can approximate second derivatives via finite differences, for example $$ \\frac{\\partial^2\\log L(\\hat{\\boldsymbol\\theta})}{\\partial\\theta_i\\partial\\theta_j} \\approx \\frac{ \\frac{\\partial\\log L(\\boldsymbol\\theta +he_i)}{\\partial\\theta_j} - \\frac{\\partial\\log L(\\boldsymbol\\theta -he_i)}{\\partial\\theta_j} }{2h}, $$ where $h$ is a small step size and $e_i$ is the $i$‑th unit vector. A common reparameterization for univariate models separates scale and shape. Let $\\psi=(\\psi_*&#39;,\\sigma_^2)&#39;$, where $\\psi_&#39;$ contains $n-1$ structural parameters and $\\sigma_*^2$ scales the disturbance variance. The measurement equation becomes $$ y_t = z_t&#39;x_t+d_t+\\varepsilon_t,\\quad Var(\\varepsilon_t)=\\sigma_*^2h_t,\\quad t=1,\\dots,T, $$ with $z_t$ an $m\\times1$ vector and $h_t$ a scalar. The transition equation is unchanged except that $Var(\\eta_t)=\\sigma_*^2Q_t$, and the initial covariance is scaled as $Var(\\alpha_0)=\\sigma_*^2P_0$. Running the filter does not require knowing $\\sigma_*^2$; only the scalar innovation variances change, $F_t=\\sigma_*^2f_t$. The log‑likelihood is then $$ \\log L(\\psi_*,\\sigma_*^2) = -\\tfrac T2\\log2\\pi-\\tfrac T2\\log\\sigma_*^2-\\tfrac12\\sum_{t=1}^T\\log f_t-\\frac1{2\\sigma_*^2}\\sum_{t=1}^T\\frac{v_t^2}{f_t}. $$ From this we get the MLE of $\\sigma_*^2$ as $$ \\hat\\sigma_*^2=\\frac1T\\sum_{t=1}^T\\frac{v_t^2}{f_t}, $$ and plugging back yields the concentrated log‑likelihood $$ \\log L_*(\\psi_*)=-\\tfrac T2\\log2(\\pi+1)-\\tfrac12\\sum_{t=1}^T\\log f_t-\\tfrac T2\\sum_{t=1}^T\\log \\hat\\sigma_*^2. $$ Maximizing this with respect to $\\psi_*$ is equivalent to minimizing a weighted sum of squared innovations. In Gaussian models the standardized innovations $F_t^{-1/2}v_t$ are i.i.d. $\\mathcal N(0,I)$ and can be used for diagnostics. For univariate models we can also work with $\\tilde v_t=v_t/\\sqrt{f_t} \\sim \\text{NID}(0,\\sigma_*^2)$. Finally, state space models naturally accommodate missing observations: we simply skip the update step for missing $y_t$ and perform only the prediction step. Common state space examples Static and time‑varying regression For a regression $$ y_t=X_t\\beta+\\varepsilon_t,\\quad \\varepsilon_t\\sim\\mathcal N(0,H_t) $$ with $\\beta$ a $k\\times1$ coefficient vector and $X_t$ an $n\\times k$ regressor matrix, and possibly time‑varying observation variance $H_t$: If $\\beta$ is constant, a state space representation is Measurement: $y_t = Z_t\\alpha_t + \\varepsilon_t$ with $Z_t = X_t$. State: $\\alpha_{t+1} = \\alpha_t = \\beta$ (i.e. $T_t=I_k$, $R_t=0$, $Q_t=0$). If $\\beta$ varies over time, we let Measurement: $y_t = X_t\\beta_t + \\varepsilon_t$ ($Z_t=X_t$). State: $\\beta_{t+1} = \\beta_t + \\eta_t$ with $T_t=I_k$, $R_t=I_k$, $Q_t=\\sigma_\\eta^2I_k$. ARMA as a state space model For an ARMA$(p,q)$ model $$ y_t = \\phi_1y_{t-1}+\\cdots+\\phi_py_{t-p} + \\zeta_t+\\theta_1\\zeta_{t-1}+\\cdots+\\theta_q\\zeta_{t-q}, $$ let $r=\\max(p,q+1)$ and define the state vector $$ \\alpha_t=\\begin{pmatrix}y_t\\ \\phi_2y_{t-1}+\\cdots+\\phi_ry_{t-r+1}+\\theta_1\\zeta_t+\\cdots+\\theta_{r-1}\\zeta_{t-r+2}\\ \\vdots\\ \\phi_ry_{t-1}+\\theta_{r-1}\\zeta_t\\ \\end{pmatrix}. $$ With $d_t=c_t=0$, $\\varepsilon_t=0$, $H_t=0$, we obtain the state space form $$ \\alpha_{t}=T_t\\alpha_{t-1}+R_t\\eta_t, $$ where $$ T_t= \\begin{bmatrix} \\phi_1&amp;1&amp;0&amp;\\cdots&amp;0\\ \\phi_2&amp;0&amp;1&amp;\\cdots&amp;0\\ \\vdots&amp;\\vdots&amp;\\vdots&amp;\\ddots&amp;\\vdots\\ \\phi_{r-1}&amp;0&amp;0&amp;\\cdots&amp;1\\ \\phi_r&amp;0&amp;0&amp;\\cdots&amp;0\\ \\end{bmatrix},\\quad R_t=\\begin{pmatrix}1\\\\theta_1\\\\theta_2\\\\vdots\\\\theta_{r-1}\\end{pmatrix}, $$ and $$ y_t = Z_t\\alpha_t =\\begin{bmatrix}1&amp;0&amp;0&amp;\\cdots&amp;0\\end{bmatrix}\\alpha_t. $$ Special cases like ARMA$(1,1)$ or ARMA$(2,1)$ follow directly from this construction. ARIMA in state space form Any $d$‑th order difference can be expressed recursively in terms of $(d-1)$‑th order differences: $$ \\begin{aligned} \\Delta y_t &amp;= y_t-y_{t-1},\\ \\Delta^2 y_t &amp;=\\Delta y_t-\\Delta y_{t-1} \\Rightarrow \\Delta y_t = \\Delta^2 y_t +\\Delta y_{t-1},\\ \\Delta^d y_t &amp;= \\Delta^{d-1} y_t-\\Delta^{d-1} y_{t-1} \\Rightarrow \\Delta^{d-1} y_t = \\Delta^d y_t +\\Delta^{d-1} y_{t-1}. \\end{aligned} $$ So we can rewrite $y_t$ in terms of $\\Delta^d y_t$ and past differences: $$ y_t = \\Delta^d y_t +\\Delta^{d-1} y_{t-1} + \\cdots + \\Delta y_{t-1}+ y_{t-1}. $$ The measurement equation thus consists of one state component for the differenced process $\\Delta^d y_t$; $d$ components to reconstruct $y_t$ from past differences. This can be implemented by augmenting the ARMA state vector with a $d$‑dimensional buffer. $$ y_t = \\begin{bmatrix}1_d&amp;1&amp;0\\end{bmatrix}&#39;\\alpha_t $$ Let $y_t^* = \\Delta^dy_t$; for an ARIMA$(p,d,q)$ model one convenient state is $$ \\alpha_t=\\begin{pmatrix} y_{t-1}\\\\Delta y_{t-1}\\\\vdots\\\\Delta^{d-1}y_{t-1}\\y^*t\\ \\phi_2y^*{t-1}+\\cdots+\\phi_ry^*_{t-r+1}+\\theta_1\\zeta_t+\\cdots+\\theta_{r-1}\\zeta_{t-r+2}\\ \\vdots\\ \\end{pmatrix} $$ with a transition matrix $T_t$ of the block form shown in the original derivation and a state equation $$ \\alpha_{t}=T_t\\alpha_{t-1}+R_t\\eta_t= \\left[ \\begin{array}{c|c} \\begin{matrix}1&amp;1&amp;\\cdots&amp;1&amp;1\\ 0&amp;1&amp;\\cdots&amp;1&amp;1\\ \\vdots&amp;\\vdots&amp;\\ddots&amp;\\vdots&amp;\\vdots\\ 0&amp;0&amp;\\cdots&amp;1&amp;1\\ 0&amp;0&amp;\\cdots&amp;0&amp;1\\ \\end{matrix} &amp; \\begin{matrix}1&amp;0&amp;0&amp;\\cdots&amp;0&amp;0\\ 1&amp;0&amp;0&amp;\\cdots&amp;0&amp;0\\ \\vdots&amp;\\vdots&amp;\\vdots&amp;\\ddots&amp;\\vdots&amp;\\vdots\\ 1&amp;0&amp;0&amp;\\cdots&amp;0&amp;0\\ 1&amp;0&amp;0&amp;\\cdots&amp;0&amp;0\\ \\end{matrix} \\ \\hline 0_{d\\times d}&amp; \\begin{matrix}\\phi_1&amp;1&amp;0&amp;\\cdots&amp;0\\ \\phi_2&amp;0&amp;1&amp;\\cdots&amp;0\\ \\vdots&amp;\\vdots&amp;\\vdots&amp;\\ddots&amp;\\vdots\\ \\phi_{r-1}&amp;0&amp;0&amp;\\cdots&amp;1\\ \\phi_r&amp;0&amp;0&amp;\\cdots&amp;0\\ \\end{matrix} \\end{array} \\right]\\alpha_{t-1} + \\begin{pmatrix}\\vdots\\0_d\\\\vdots\\1\\\\theta_1\\\\theta_2\\\\vdots\\\\theta_{r-1}\\end{pmatrix}\\zeta_{t} $$ This yields a linear state space representation for any ARIMA$(p,d,q)$ model. SARIMA and SARIMAX For a seasonal ARIMA model $$ \\phi_p (L) \\tilde \\phi_P (L^s) \\Delta^d \\Delta_s^D y_t = heta_q (L) \\tilde \\theta_Q (L^s) \\zeta_t, $$ we can combine the non‑seasonal and seasonal polynomials into $$ \\Phi (L) \\Delta^d \\Delta_s^D y_t = \\Theta (L) \\zeta_t, $$ where $\\Phi$ is degree $p+sP$ and $\\Theta$ is degree $q+sQ$. This is equivalent to an ARMA$(p+sP,q+sQ)$ model for the differenced series $\\Delta^d \\Delta_s^D y_t$, and can be mapped to state space exactly as in the ARIMA case. The state equation of SARIMA is $$ \\alpha_{t}=T_t\\alpha_{t-1}+R_t\\eta_t= \\left[ \\begin{array}{c|c} \\begin{matrix}1&amp;1&amp;\\cdots&amp;1&amp;1\\ 0&amp;1&amp;\\cdots&amp;1&amp;1\\ \\vdots&amp;\\vdots&amp;\\ddots&amp;\\vdots&amp;\\vdots\\ 0&amp;0&amp;\\cdots&amp;1&amp;1\\ 0&amp;0&amp;\\cdots&amp;0&amp;1\\ \\end{matrix} &amp; \\begin{matrix}1&amp;0&amp;0&amp;\\cdots&amp;0&amp;0\\ 1&amp;0&amp;0&amp;\\cdots&amp;0&amp;0\\ \\vdots&amp;\\vdots&amp;\\vdots&amp;\\ddots&amp;\\vdots&amp;\\vdots\\ 1&amp;0&amp;0&amp;\\cdots&amp;0&amp;0\\ 1&amp;0&amp;0&amp;\\cdots&amp;0&amp;0\\ \\end{matrix} \\ \\hline 0_{d\\times d}&amp; \\begin{matrix}\\phi_1&amp;1&amp;0&amp;\\cdots&amp;0\\ \\phi_2&amp;0&amp;1&amp;\\cdots&amp;0\\ \\vdots&amp;\\vdots&amp;\\vdots&amp;\\ddots&amp;\\vdots\\ \\phi_{r-1}&amp;0&amp;0&amp;\\cdots&amp;1\\ \\phi_r&amp;0&amp;0&amp;\\cdots&amp;0\\ \\end{matrix} \\end{array} \\right]\\alpha_{t-1} + \\begin{pmatrix}\\vdots\\0_d\\\\vdots\\1\\\\theta_1\\\\theta_2\\\\vdots\\\\theta_{r-1}\\end{pmatrix}\\zeta_{t} $$ Adding exogenous regressors leads to SARIMAX models $(p, d, q) \\times (P, D, Q)_s$ $$ \\begin{array}{llll} \\phi_p (L) \\tilde \\phi_P (L^s) \\Delta^d \\Delta_s^D y_t = A(t) + \\theta_q (L) \\tilde \\theta_Q (L^s) \\zeta_t\\nonumber \\end{array} $$ where the regression part is handled as in the regression‑in‑state‑space section and the residuals follow a SARIMA process $$ \\begin{array}{llll} y_t &amp; = \\beta_t x_t + u_t \\nonumber\\ \\phi_p (L) \\tilde \\phi_P (L^s) \\Delta^d \\Delta_s^D u_t &amp; = A(t) + \\theta_q (L) \\tilde \\theta_Q (L^s) \\zeta_t\\nonumber \\end{array} $$ Regression with ARMA errors Consider $$ y_t=X_t\\beta+\\xi_t,\\quad \\xi_t\\sim\\text{ARMA}(p,q). $$ We can represent this as a state space system Measurement $$ y_t = Z_t\\alpha_t, $$ with $$ Z_t = \\begin{bmatrix}X_t&amp;1&amp;0&amp;\\cdots&amp;0\\end{bmatrix},\\quad \\alpha_t = \\begin{bmatrix}\\beta_t\\ \\alpha_{t,\\text{ARMA}}\\end{bmatrix}. $$ State $$ \\alpha_{t+1} = T_t\\alpha_t + R_t\\eta_t, $$ where typically $$ T_t =\\begin{bmatrix}I_k &amp; 0 \\ 0 &amp; T_{\\text{ARMA}}\\end{bmatrix},\\quad R_t =\\begin{bmatrix}0 \\ R_{\\text{ARMA}}\\end{bmatrix}. $$ This formulation allows simultaneous estimation of regression coefficients and ARMA error structure using the Kalman filter and smoother. "},{"slug":"arima-model","title":"ARIMA Model","tags":["Statistics","TimeSeriesAnalysis"],"content":"ARIMA (Autoregressive Integrated Moving Average) is a classic statistical model that combines autoregression, differencing (to achieve stationarity), and moving average, and is used to forecast stationary or non‑stationary time series. Basic Concepts Stationarity If we regard a time series ${y_t, y_{t-1}, ..., y_0}$ as observations of a sequence of random variables, where each time $t$ corresponds to a random variable $y_t$: Mean $\\mu_t = E(y_t)$ Variance $\\sigma_t^2 = E(y_t - \\mu_t)^2$ Autocovariance $\\gamma(t,k) = Cov(y_t, y_k) = E[(y_t - \\mu_t)(y_k - \\mu_k)]$ Autocorrelation $\\rho(t,k) = \\frac{\\gamma(t,k)}{\\sqrt{\\sigma_t^2 \\times \\sigma_k^2}} = \\frac{\\gamma(t,k)}{\\sigma_t \\times \\sigma_k}$ For forecasting models to be valid, the probability distribution of future data must be consistent with that of historical samples. In time series analysis, this property is called stationarity. Stationarity is usually divided into two types: Strong stationarity All statistical properties of the sequence do not change over time $t$. The random variables at any time all come from exactly the same probability distribution. Weak stationarity It is sufficient that the low‑order moments of the sequence are constant (independent of time): The mean and variance do not change over time $t$: $E(y_t) = E(y_{t-j}) = \\mu$ $Var(y_t) = Var(y_{t-j}) = \\sigma^2$ The autocovariance depends only on the time lag $s$, not on the starting point $t$: $Cov(y_t, y_{t-s}) = Cov(y_{t-j}, y_{t-j-s}) = \\gamma_s$ As long as a time series satisfies weak stationarity, we can estimate its statistics from historical observations and use them for forecasting: Mean estimate $\\hat{\\mu} = \\bar{y} = \\frac{1}{T}\\sum_{t=1}^{T} y_t$ Variance estimate $\\hat{\\sigma}^2 = \\frac{1}{T-1}\\sum_{t=1}^{T-1} (y_t - \\bar{y})^2$ Autocovariance estimate $\\hat{\\gamma_s} = \\hat{\\gamma_{-s}} = \\frac{1}{T-s}\\sum_{t=1}^{T-s} (y_t - \\bar{y})(y_{t+s} - \\bar{y})$ Autocorrelation estimate $\\hat{\\rho_s} = \\hat{\\rho_{-s}} = \\frac{\\hat{\\gamma_s}}{\\hat{\\gamma_0}}$ A special case of a stationary series is white noise (WN) process ${\\varepsilon_t} \\sim \\text{WN}(0, \\sigma_{\\varepsilon}^2)$, whose statistics satisfy: Zero mean: $E(\\varepsilon_t) = 0$ Homoscedasticity: $Var(\\varepsilon_t) = \\sigma_{\\varepsilon}^2$ No autocorrelation: $Cov(\\varepsilon_t, \\varepsilon_{t-s}) = 0$ In particular, if ${\\varepsilon_t}$ follows a normal distribution, it is called a Gaussian white noise process. Although a white noise process is stationary, it is highly random (no correlation between random variables at different times), so it has no value for modeling. Random Walk Random walk is a typical non‑stationary process, widely used in finance and economics. It is often used to describe market behavior: The asset price at the next time $y_t$ depends only on the previous price $y_{t-1}$, and the price change $\\varepsilon_t = y_t - y_{t-1}$ is determined by uncertain factors in the market. The non‑stationary series generated by a random walk contain two main types of trend components: Stochastic trend: sudden and unpredictable changes in direction Deterministic trend: a clear long‑term upward or downward trend Consider the following random walk models: Zero‑mean random walk (RW) $$ y_t = y_{t-1} + \\varepsilon_t = (y_{t-2} + \\varepsilon_{t-1}) + \\varepsilon_t = ... = y_0 + \\sum_{i=1}^t \\varepsilon_i $$ - Stochastic trend: the effect of the initial value $y_0$ and the past shocks $\\varepsilon_i$ on $y_t$ never decays. - Deterministic trend: none. Random walk with drift (RWD) $$ y_t = c + y_{t-1} + \\varepsilon_t = c + (c + y_{t-2} + \\varepsilon_{t-1}) + \\varepsilon_t = ... = y_0 + c \\times t + \\sum_{i=1}^t \\varepsilon_i $$ - Stochastic trend: same as above. - Deterministic trend: as time increases, there is a linear drift at rate $c$. Random walk with drift and deterministic trend (RWD+DT) $$ y_t = c_1 + c_2 t + y_{t-1} + \\varepsilon_t = y_0 + c_1 t + c_2 \\frac{t(t+1)}{2} + \\sum_{i=1}^t \\varepsilon_i = y_0 + (c_1 + \\frac{c_2}{2})t + \\frac{c_2}{2} t^2 + \\sum_{i=1}^t \\varepsilon_i $$ - Stochastic trend: same as above. - Deterministic trend: linear trend + quadratic trend. If a time series has a stochastic trend, its future observations diverge and cannot converge to a bounded range. Therefore, before modeling, we need to test the time series to determine whether a stochastic trend is present. Such tests are called unit root tests. Similarly, deterministic trends also cause divergence of observations, so we must remove both types of trends before modeling to ensure stationarity. Differencing Stochastic trends and linear deterministic trends can be removed via differencing to obtain a stationary white noise series: RW: $y_t - y_{t-1} = \\varepsilon_t$ RWD: $y_t - y_{t-1} = c + \\varepsilon_t$ However, non‑linear deterministic trends cannot be removed via simple differencing: RWD+DT: $y_t - y_{t-1} = c_1 + c_2 t + \\varepsilon_t$ Regression Approach For non‑linear deterministic trends, we can use linear regression to extract a stationary residual term and then build a model on the residuals. Using RWD+DT as an example: Regress $y_t$ on time $t$ to obtain $y_t = a + b t + e_t$. Compute residuals $e_t = y_t - a - b t$ to remove the trend. Model the stationary residual series ${e_t, e_{t-1}, ..., e_1}$. By applying data transformations and adding non‑linear terms, the regression approach can in principle handle arbitrary non‑linear trends. We can determine the order of the regression model via t‑tests or F‑tests of the regression coefficients. Differencing Transformations To transform a non‑stationary time series into a stationary one, we usually perform two operations: Stabilize the variance: transform the data via log or Box‑Cox. Stabilize the mean: remove trend and seasonality via differencing. Differencing removes stochastic trends and linear deterministic trends and prevents divergence of the series, thereby stabilizing the mean: First‑order difference $y&#39;t = y_t - y{t-1}$ Second‑order difference $y&#39;&#39;t = y&#39;t - y&#39;{t-1} = y_t - 2y{t-1} + y_{t-2}$ In addition to trend, differencing can also remove seasonal components (where $m$ is the seasonal length): Seasonal difference $y&#39;t = y_t - y{t-m}$ Seasonal + first difference $y^*t = y&#39;t - y&#39;{t-1} = y_t - y{t-1} - y_{t-m} + y_{t-m-1}$ For strongly seasonal time series, it is usually recommended to perform seasonal differencing first. If the differenced series is already stationary, there is no need for further differencing. Over‑differencing introduces unnecessary noise. Take RWD as an example: First‑order difference $y&#39;t = y_t - y{t-1} = c + \\varepsilon_t$ Second‑order difference $y&#39;&#39;t = y&#39;t - y&#39;{t-1} = \\varepsilon_t - \\varepsilon{t-1}$ We see that the variance increases significantly after second‑order differencing: First‑order: $Var(y&#39;t) = Var(c) + Var(\\varepsilon_t) = \\sigma{\\varepsilon}^2$ Second‑order: $Var(y&#39;&#39;t) = Var(\\varepsilon_t) + Var(\\varepsilon{t-1}) = 2\\sigma_{\\varepsilon}^2$ Moreover, each differencing step reduces one usable data point. Too many differences hurt sample quality. Backshift Operator The backshift notation represents shifting an observation back by one time period: First backshift $B y_t = y_{t-1}$ Second backshift $B^2 y_t = y_{t-2}$ Seasonal backshift $B^m y_t = y_{t-m}$ Using the backshift operator, differencing can be written as: First‑order difference $y&#39;t = y_t - y{t-1} = (1 - B)y_t$ Second‑order difference $y&#39;&#39;t = y&#39;t - y&#39;{t-1} = y_t - 2y{t-1} + y_{t-2} = (1 - B)^2 y_t$ $d$‑th‑order difference $y^d_t = (1 - B)^d y_t$ Seasonal difference $y&#39;t = y_t - y{t-m} = (1 - B^m)y_t$ Seasonal + first difference $y^*_t = y&#39;t - y&#39;{t-1} = (1 - B)(1 - B^m)y_t$ Backshift operators can be multiplied, which makes it easy to see the actual effect of combined differences: $$ (1 - B)(1 - B^m)y_t = (1 - B - B^m + B^{m+1})y_t $$ This property can also be used to determine whether a model contains redundant terms. Suppose we have an ARMA model $y_t = 0.5 y_{t-1} + 0.24 y_{t-2} + e_t + 0.6 e_{t-1} + 0.09 e_{t-2}$. Using the backshift notation, this becomes $(1 + 0.3B)(1 - 0.8B) y_t = (1 + 0.3B)(1 + 0.3B) e_t$. By canceling common factors, it can be simplified to a lower‑order model $(1 - 0.8B) y_t = (1 + 0.3B) e_t$. AR Model An autoregressive model (AR) is a multivariate regression model based on lagged values. The number of lagged terms is called the order of the AR model. A $p$‑th‑order AR model can be written as: $$ \\text{AR}(p):\\quad y_t = c + \\phi_1 y_{t-1} + \\phi_2 y_{t-2} + \\cdots + \\phi_p y_{t-p} + \\varepsilon_t $$ $\\phi_i$ are autoregressive coefficients. $\\varepsilon_t$ is white noise. The simplest case is $\\text{AR}(1): y_t = c + \\phi_1 y_{t-1} + \\varepsilon_t$. $\\phi_1 = 0, c = 0 \\ \\to\\ y_t = \\varepsilon_t$ The $\\text{AR}(1)$ process reduces to WN. $\\phi_1 = 1, c = 0 \\ \\to\\ y_t = y_{t-1} + \\varepsilon_t$ The $\\text{AR}(1)$ process reduces to RW. $\\phi_1 = 1, c \\ne 0 \\ \\to\\ y_t = c + y_{t-1} + \\varepsilon_t$ The $\\text{AR}(1)$ process reduces to RWD. $|\\phi_1| &lt; 1$ The $\\text{AR}(1)$ process is stationary. $\\phi_1 &lt; 0$ The $\\text{AR}(1)$ process oscillates between positive and negative values. From these examples we see that only when the autoregressive coefficients satisfy certain conditions is the $\\text{AR}(p)$ process stationary. Stationarity For an $\\text{AR}(p)$ process we can construct a $p$‑dimensional coefficient matrix $A$: First row: contains the model parameters $\\phi_i$, representing the influence of each lagged value on the current value. Subsequent rows: form the lower part of an identity matrix, describing how the lagged variables $Y_{t-1}, ..., Y_{t-p}$ affect the state. $$ A = \\begin{bmatrix} \\phi_1 &amp; \\phi_2 &amp; \\cdots &amp; \\phi_p\\ 1 &amp; 0 &amp; \\cdots &amp; 0 \\ 0 &amp; 1 &amp; \\cdots &amp; 0 \\ \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\ 0 &amp; 0 &amp; \\cdots &amp; 1 \\end{bmatrix} $$ Let $\\lambda$ be an eigenvalue and construct the characteristic equation $\\det(A - \\lambda I) = 0$, which leads to $$ \\lambda^p - \\phi_1 \\lambda^{p-1} - \\phi_2 \\lambda^{p-2} - ... - \\phi_p = 0. $$ Let $z = 1/\\lambda$ to rewrite it as $$ 1 - \\phi_1 z - \\phi_2 z^2 - ... - \\phi_p z^p = 0. $$ Solving this equation yields the characteristic roots $\\lambda_1, \\lambda_2, ..., \\lambda_p$. Their magnitudes describe how strongly past observations affect the current observation: $|\\lambda| &lt; 1$: impact decays; the series reverts to the mean. $|\\lambda| = 1$: impact persists; may cause a trend. $|\\lambda| &gt; 1$: impact grows; the series diverges. Since characteristic roots are often complex, we usually plot them on the complex plane using the unit circle (radius 1, centered at the origin): $|\\lambda| &lt; 1$: the root is inside the unit circle. $|\\lambda| \\ge 1$: the root lies on or outside the unit circle. Stationarity of an $\\text{AR}(p)$ model can be expressed in terms of unit roots: Stationary model: no unit roots, $\\forall |\\lambda_i| &lt; 1$. Non‑stationary model: at least one unit root, $\\exists |\\lambda_i| = 1$. For $\\text{AR}(1)$ the characteristic equation is $$ \\lambda^1 - \\phi_1 \\lambda^0 = \\lambda - \\phi_1 = 0. $$ The root is $\\lambda = \\phi_1$. As long as $|\\phi_1| &lt; 1$, the model is stationary. For $\\text{AR}(2)$, the characteristic equation is $$ \\lambda^2 - \\phi_1 \\lambda^1 - \\phi_2 \\lambda^0 = \\lambda^2 - \\phi_1 \\lambda - \\phi_2 = 0. $$ According to the quadratic formula, when the discriminant $\\Delta = b^2 - 4ac = (-\\phi_1)^2 + 4 \\phi_2 \\ge 0$, there are two real roots $$ \\lambda = \\frac{\\phi_1 \\pm \\sqrt{\\phi_1^2 + 4 \\phi_2}}{2}. $$ If $|\\phi_2| &lt; 1$ and $\\phi_2 \\pm \\phi_1 &lt; 1$ hold simultaneously, the model is stationary. Moments For a stationary $\\text{AR}(p)$ process, the mean is $$ E(y_t) = E(c) + \\phi_1 E(y_{t-1}) + \\phi_2 E(y_{t-2}) + \\cdots + E(\\varepsilon_t). $$ For a stationary process, $E(y_t) = E(y_{t-1}) = \\cdots = \\mu$. Substituting gives $$ \\mu = E(c) + \\phi_1 \\mu + \\phi_2 \\mu + \\cdots + E(\\varepsilon_t), $$ which yields $$ \\mu = \\frac{c}{1 - \\phi_1 - \\phi_2 - \\cdots - \\phi_p}. $$ For the variance of a stationary $\\text{AR}(p)$ process, $$ Var(y_t) = Var(c) + \\phi_1 Var(y_{t-1}) + \\phi_2 Var(y_{t-2}) + \\cdots + Var(\\varepsilon_t). $$ For stationarity we have $Var(y_t) = Var(y_{t-1}) = \\cdots = \\sigma_y^2$. Solving gives $$ \\sigma_y^2 = \\frac{\\sigma_{\\varepsilon}^2}{1 - \\phi_1 - \\phi_2 - \\cdots - \\phi_p}. $$ The autocovariance of a stationary $\\text{AR}(p)$ process can be computed recursively: $Cov(y_t, y_{t-1}) = \\phi_1 \\sigma_y^2$ $Cov(y_t, y_{t-s}) = \\sum_{i=1}^p \\phi_i Cov(y_{t-i}, y_{t-s})$ Forecast Error Let $\\hat{y}_{t+l|t}$ denote the $l$‑step‑ahead forecast based on all observations up to time $t$. Given information set $I_t$, the mean squared forecast error is $$ MSE(\\hat{y}{t+l}|I_t) = E[(y{t+l} - \\hat{y}_{t+l})^2 | I_t]. $$ The conditional expectation $E[y_{t+l} | I_t]$ is the $l$‑step forecast that minimizes MSE, so $$ e_t(l) = y_{t+l} - \\hat{y}{t+l|t} = y{t+l} - E[y_{t+l} | I_t] $$ is the $l$‑step forecast error. One‑step‑ahead forecast for $\\text{AR}(p)$ Model: $$ y_{t+1} = c + \\phi_1 y_t + \\phi_2 y_{t-1} + \\cdots + \\phi_p y_{t-p+1} + \\varepsilon_{t+1}. $$ Conditional expectation: $$ \\hat{y}{t+1|t} = E(y{t+1} | I_t) = E(c + \\phi_1 y_t + \\phi_2 y_{t-1} + \\cdots + \\phi_p y_{t-p+1} + \\varepsilon_{t+1} | I_t). $$ Given $I_t = {y_t, y_{t-1}, ..., y_{t-p}}$, $y_i$ $(i = t, t-1, ..., t-p+1)$ are constants, so $$ \\hat{y}{t+1|t} = c + \\phi_1 y_t + \\phi_2 y{t-1} + \\cdots + \\phi_p y_{t-p+1}. $$ One‑step forecast error: $e_t(1) = y_{t+1} - \\hat{y}{t+1|t} = \\varepsilon{t+1}$. Variance: $Var(e_t(1)) = Var(\\varepsilon_{t+1}) = \\sigma_\\varepsilon^2$. If $\\varepsilon_t$ is normal, the 95% one‑step prediction interval for $y_{t+1}$ is $\\hat{y}{t+1|t} \\pm 1.96 \\sigma\\varepsilon$. Two‑step‑ahead forecast for $\\text{AR}(p)$ Model: $$ y_{t+2} = c + \\phi_1 y_{t+1} + \\phi_2 y_t + \\cdots + \\phi_p y_{t-p+2} + \\varepsilon_{t+2}. $$ Conditional expectation: $$ \\hat{y}{t+2|t} = E(y{t+2} | I_t) = E(c + \\phi_1 y_{t+1} + \\phi_2 y_t + \\cdots + \\phi_p y_{t-p+2} + \\varepsilon_{t+2} | I_t). $$ Here $y_{t+1}$ is unknown, so we replace it with its forecast $\\hat{y}_{t+1|t}$: $$ \\hat{y}{t+2|t} = c + \\phi_1 \\hat{y}{t+1|t} + \\phi_2 y_t + \\cdots + \\phi_p y_{t-p+2}. $$ Two‑step forecast error: $e_t(2) = y_{t+2} - \\hat{y}{t+2|t} = \\phi_1 \\varepsilon{t+1} + \\varepsilon_{t+2}$. Variance: $Var(e_t(2)) = Var(\\phi_1 \\varepsilon_{t+1} + \\varepsilon_{t+2}) = (\\phi_1^2 + 1) \\sigma_\\varepsilon^2$. If $\\varepsilon_t$ is normal, the 95% two‑step prediction interval for $y_{t+2}$ is $\\hat{y}{t+2|t} \\pm 1.96 \\sqrt{(\\phi_1^2 + 1) \\sigma\\varepsilon^2}$. $l$‑step‑ahead forecast for $\\text{AR}(p)$ Model: $$ y_{t+l} = c + \\phi_1 y_{t+l-1} + \\phi_2 y_{t+l-2} + \\cdots + \\phi_p y_{t-p+l} + \\varepsilon_{t+l}. $$ Conditional expectation: $$ \\hat{y}{t+l|t} = E(y{t+l} | I_t). $$ Given $I_t = {y_t, y_{t-1}, ..., y_{t-p}}$, we have $$ \\hat{y}{t+l|t} = c + \\sum{i=1}^p \\phi_i \\hat{y}_{t+l-i|t}. $$ As $l \\to \\infty$, $\\hat{y}_{t+l|t}$ converges to $E(y_t)$, i.e. long‑term forecasts converge to the unconditional mean $\\mu$. $l$‑step forecast error: $$ \\begin{aligned} e_t(l) &amp;= y_{t+l} - \\hat{y}{t+l|t} \\ &amp;= \\sum{i=1}^p \\phi_i y_{t+l-i} - \\sum_{i=1}^p \\phi_i \\hat{y}{t+l-i|t} + \\varepsilon{t+l} \\ &amp;= \\phi_1 (y_{t+l-1} - \\hat{y}{t+l-1|t}) + \\phi_2 (y{t+l-2} - \\hat{y}{t+l-2|t}) + \\cdots + \\varepsilon{t+l} \\ &amp;= \\sum_{i=1}^p \\phi_i \\varepsilon_{t+l-i} + \\varepsilon_{t+l}. \\end{aligned} $$ Variance of $l$‑step forecast error: $$ \\begin{aligned} Var(e_t(l)) &amp;= Var\\Big(\\sum_{i=1}^p \\phi_i \\varepsilon_{t+l-i} + \\varepsilon_{t+l}\\Big) \\ &amp;= \\phi_1^2 Var(\\varepsilon_{t+l-1}) + \\phi_2^2 Var(\\varepsilon_{t+l-2}) + \\cdots + Var(\\varepsilon_{t+l}) \\ &amp;= (\\phi_1^2 + \\phi_2^2 + \\cdots + \\phi_p^2 + 1) \\sigma_\\varepsilon^2. \\end{aligned} $$ Autocorrelation Function The autocorrelation function (ACF) captures indirect correlation: the cumulative impact of all random variables $y_{t-1}, ..., y_{t-s}$ within lag $s$ on $y_t$. The partial autocorrelation function (PACF) captures direct correlation: the pure correlation between $y_t$ and $y_{t-s}$ after removing the effects of intermediate lags. For an $\\text{AR}(p)$ model, the autocovariance is $$ \\begin{aligned} \\gamma_s &amp;= Cov(y_t, y_{t-s}) \\ &amp;= E[(y_t - c)(y_{t-s} - c)] \\ &amp;= E[(\\phi_1 y_{t-1} + \\phi_2 y_{t-2} + \\cdots + \\phi_p y_{t-p} + \\varepsilon_t)(y_{t-s} - c)] \\ &amp;= \\phi_1 E(y_{t-1} y_{t-s}) + \\phi_2 E(y_{t-2} y_{t-s}) + \\cdots + \\phi_p E(y_{t-p} y_{t-s}) + E(\\varepsilon_t y_{t-s}) \\ &amp;= \\phi_1 \\gamma_{s-1} + \\phi_2 \\gamma_{s-2} + \\cdots + \\phi_p \\gamma_{s-p} + E(\\varepsilon_t y_{t-s}). \\end{aligned} $$ The autocorrelation is $$ \\rho_s = \\frac{\\gamma_s}{\\gamma_0} = \\phi_1 \\rho_{s-1} + \\phi_2 \\rho_{s-2} + \\cdots + \\phi_p \\rho_{s-p} + \\frac{E(\\varepsilon_t y_{t-s})}{\\gamma_0}, $$ where $\\gamma_0 = Var(y_t)$ is simply the variance of the series. When $s = 0$, $$ 1 = \\rho_0 = \\phi_1 \\rho_1 + \\phi_2 \\rho_2 + \\cdots + \\phi_p \\rho_p + \\frac{E(\\varepsilon_t y_t)}{\\gamma_0}. $$ Autocorrelation is symmetric, so the above holds as written. Since $\\varepsilon_t$ is uncorrelated with $y_{t-1}, y_{t-2}, ...$, but correlated with $y_t$, we have $E(\\varepsilon_t y_t) \\ne E(\\varepsilon_t) E(y_t)$. Expanding, $$ \\begin{aligned} E(\\varepsilon_t y_t) &amp;= E\\big[\\varepsilon_t (c + \\phi_1 y_{t-1} + \\phi_2 y_{t-2} + \\cdots + \\phi_p y_{t-p}) + \\varepsilon_t^2\\big] \\ &amp;= E\\big[\\varepsilon_t (c + \\phi_1 y_{t-1} + \\phi_2 y_{t-2} + \\cdots + \\phi_p y_{t-p})\\big] + E(\\varepsilon_t^2) \\ &amp;= E(\\varepsilon_t) E(c + \\phi_1 y_{t-1} + \\phi_2 y_{t-2} + \\cdots + \\phi_p y_{t-p}) + E(\\varepsilon_t^2) \\ &amp;= 0 \\times E(c + \\phi_1 y_{t-1} + \\phi_2 y_{t-2} + \\cdots + \\phi_p y_{t-p}) + \\sigma_\\varepsilon^2. \\end{aligned} $$ Thus, $$ \\rho_0 = \\phi_1 \\rho_1 + \\phi_2 \\rho_2 + \\cdots + \\phi_p \\rho_p + \\frac{\\sigma_\\varepsilon^2}{\\gamma_0}. $$ When $s &gt; 0$, $y_{t-s}$ cannot be $y_t$, so $\\varepsilon_t$ is uncorrelated with $y_{t-s}$ and $$ E(\\varepsilon_t y_{t-s}) = E(\\varepsilon_t) E(y_{t-s}) = 0 \\times E(y_{t-s}) = 0. $$ Therefore, $$ \\rho_s = \\phi_1 \\rho_{s-1} + \\phi_2 \\rho_{s-2} + \\cdots + \\phi_p \\rho_{s-p}, \\quad s &gt; 0. $$ Now introduce coefficients $\\alpha_i$ to represent the direct relationship between $y_t$ and $y_{t-i}$ (lag $i$). From $\\rho_s = \\phi_1 \\rho_{s-1} + \\cdots + \\phi_p \\rho_{s-p}$, we obtain the linear system: $$ \\begin{cases} \\rho_1 = \\alpha_1 \\rho_0 + \\cdots + \\alpha_p \\rho_{p-1} \\ \\rho_2 = \\alpha_1 \\rho_1 + \\cdots + \\alpha_p \\rho_{p-2} \\ \\vdots \\ \\rho_s = \\alpha_1 \\rho_{s-1} + \\cdots + \\alpha_p \\rho_{s-p} \\end{cases} \\ \\to\\begin{bmatrix} \\rho_0 &amp; \\rho_1 &amp; \\cdots &amp; \\rho_{p-1} \\ \\rho_1 &amp; \\rho_0 &amp; \\cdots &amp; \\rho_{p-2} \\ \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\ \\rho_{s-1} &amp; \\rho_{s-2} &amp; \\cdots &amp; \\rho_{s-p} \\end{bmatrix} \\begin{bmatrix} \\alpha_1 \\ \\alpha_2 \\ \\vdots \\ \\alpha_p \\end{bmatrix} \\begin{bmatrix} \\rho_1 \\ \\rho_2 \\ \\vdots \\ \\rho_s \\end{bmatrix} \\ \\to\\ R \\boldsymbol{\\alpha} = \\boldsymbol{\\rho}. $$ Solving $\\boldsymbol{\\alpha} = R^{-1} \\boldsymbol{\\rho}$ gives the partial autocorrelations $\\alpha_1, ..., \\alpha_p$. For $s &gt; p$, the autoregressive coefficient of $y_{t-s}$ is zero, so there is no direct correlation between $y_t$ and $y_{t-s}$. The PACF of an $\\text{AR}(p)$ model is $$ \\alpha_s = \\begin{cases} 1, &amp; s = 0, \\ \\alpha_s, &amp; 1 \\le s \\le p, \\ 0, &amp; s &gt; p. \\end{cases} $$ Parameter Estimation Yule–Walker For a zero‑mean $\\text{AR}(p)$ process, move terms to the left and multiply both sides by $y_{t-h}$ $(h \\ge 0)$: $$ y_{t-h} (y_t - \\phi_1 y_{t-1} - \\phi_2 y_{t-2} - \\cdots - \\phi_p y_{t-p}) = y_{t-h} \\varepsilon_t. $$ Take expectation and express in terms of autocovariances $\\gamma_i = E(y_t y_{t-i})$: $$ \\gamma_h - \\sum_{i=1}^p \\phi_i \\gamma_{h-i} = E(y_{t-h} \\varepsilon_t) = \\begin{cases} \\sigma_\\varepsilon^2, &amp; h = 0, \\ 0, &amp; h \\ge 1. \\end{cases} $$ For $h = 1, ..., p$ we have $$ \\gamma_h - \\sum_{i=1}^p \\phi_i \\gamma_{h-i} = 0. $$ This gives the Yule–Walker equations: $$ \\begin{bmatrix} \\gamma_0 &amp; \\gamma_1 &amp; \\cdots &amp; \\gamma_{p-1} \\ \\gamma_1 &amp; \\gamma_0 &amp; \\cdots &amp; \\gamma_{p-2} \\ \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\ \\gamma_{p-1} &amp; \\gamma_{p-2} &amp; \\cdots &amp; \\gamma_0 \\end{bmatrix} \\begin{bmatrix} \\phi_1 \\ \\phi_2 \\ \\vdots \\ \\phi_p \\end{bmatrix} \\begin{bmatrix} \\gamma_1 \\ \\gamma_2 \\ \\vdots \\ \\gamma_p \\end{bmatrix} \\ \\to\\ \\boldsymbol{\\Gamma} \\boldsymbol{\\phi} = \\boldsymbol{\\gamma}. $$ Plug in the sample autocovariances $\\hat{\\gamma}h = \\big(\\sum{t=1}^{n-|h|} y_t y_{t+|h|}\\big)/n$ to obtain $$ \\boldsymbol{\\hat{\\phi}} = \\boldsymbol{\\Gamma}^{-1} \\boldsymbol{\\hat{\\gamma}}. $$ From $\\gamma_0 - \\sum_{i=1}^p \\phi_i \\gamma_{0-i} = \\sigma_\\varepsilon^2$ we obtain $$ \\hat{\\sigma}\\varepsilon^2 = \\gamma_0 - \\sum{i=1}^p \\phi_i r_i. $$ Burg If the matrix $\\boldsymbol{\\Gamma}$ is ill‑conditioned, the Yule–Walker method becomes very sensitive to outliers, making the estimates $\\boldsymbol{\\hat{\\phi}}$ unstable. To improve stability, the Burg algorithm is often used to estimate $\\text{AR}(p)$ parameters, especially for short series. Burg’s method estimates AR parameters by minimizing both forward and backward prediction errors. Consider two $\\text{AR}(k)$ models: Forward model (predicting current from past): $$ y_t^+ = \\phi_1 y_{t-1} + \\cdots + \\phi_k y_{t-k} + \\varepsilon_t^+, \\quad t \\in [k, n]. $$ The errors are called forward errors and reflect predictive performance on future data. Backward model (predicting current from future): $$ y_t^- = \\phi_1 y_{t+1} + \\cdots + \\phi_k y_{t+k} + \\varepsilon_t^-, \\quad t \\in [0, n-k]. $$ The errors are called backward errors and reflect fit on past data. Burg minimizes the sum of forward and backward squared errors: Forward error: $$ F_k = \\sum_{t=k}^n (y_t - y_t^+)^2 = \\sum_{t=k}^n \\Big(y_t - \\sum_{i=1}^k \\phi_i y_{t-i}\\Big)^2. $$ Backward error: $$ B_k = \\sum_{t=0}^{n-k} (y_t - y_t^-)^2 = \\sum_{t=0}^{n-k} \\Big(y_t - \\sum_{i=1}^k \\phi_i y_{t+i}\\Big)^2. $$ Let $\\phi_0 = 1$. Define $F_k = \\sum_{t=k}^n (f_k(t))^2$, where $f_k(t) = \\sum_{i=0}^k \\phi_i y_{t-i}$. $B_k = \\sum_{t=0}^{n-k} (b_k(t))^2$, where $b_k(t) = \\sum_{i=0}^k \\phi_i y_{t+i}$. Burg uses a recursive update $$ A_{k+1} = A_k + \\mu V_k, $$ where $A_k = [1, \\phi_1, ..., \\phi_k]$, $V_k = [0, \\phi_k, ..., \\phi_2, \\phi_1, 1]$. Let $\\phi_{k+1} = 1$. The recursion can be written component‑wise as $$ \\phi_i&#39; = \\phi_i + \\mu \\phi_{k+1-i}. $$ To ensure $A_{k+1}$ is better than $A_k$, we choose $\\mu$ to minimize $F_{k+1} + B_{k+1}$. Expanding $F_{k+1} + B_{k+1} = \\sum_{t=k+1}^n f_{k+1}(t)^2 + \\sum_{t=0}^{n-k-1} b_{k+1}(t)^2$ yields $f_{k+1}(t) = f_k(t) + \\mu b_k(t-k-1)$, $b_{k+1}(t) = b_k(t) + \\mu f_k(t+k+1)$. Setting $\\partial(F_{k+1} + B_{k+1})/\\partial \\mu = 0$ gives $$ \\mu = \\frac{-2 \\sum_{t=0}^{n-k-1} f_k(t+k+1) b_k(t)}{\\sum_{t=k+1}^n f_k(t)^2 + \\sum_{t=0}^{n-k-1} b_k(t)^2}. $$ The overall Burg algorithm is: Choose order $p$. Initialize $A_0 = [1]$. Initialize $f_0(t) = b_0(t) = y_t$. For $k = 0, ..., p-1$: Compute $\\mu$ and update $A_{k+1}$. Update $f_{k+1}(t)$ for $t \\in [k+1, n]$. Update $b_{k+1}(t)$ for $t \\in [0, n-k-1]$. A detailed derivation can be found in, e.g., https://c.mql5.com/3/133/Tutorial_on_Burg_smethod_algorithm_recursion.pdf MA Model A moving average model (MA) is a multivariate regression model based on past errors. The number of error terms is called the order of the MA model. A $q$‑th‑order MA model is $$ \\text{MA}(q):\\quad y_t = c + \\varepsilon_t + \\theta_1 \\varepsilon_{t-1} + \\theta_2 \\varepsilon_{t-2} + \\cdots + \\theta_q \\varepsilon_{t-q}, $$ $\\theta_i$ are error weights. $\\varepsilon_t$ is white noise. Moments For an $\\text{MA}(q)$ process: Mean: $E(y_t) = c$. Variance: $Var(y_t) = (1 + \\theta_1^2 + \\theta_2^2 + \\cdots + \\theta_q^2) \\sigma_\\varepsilon^2$. Autocovariance: $$ Cov(y_t, y_{t-s}) = E[(y_t - c)(y_{t-s} - c)] = \\begin{cases} (1 + \\theta_1^2 + \\cdots + \\theta_q^2) \\sigma_\\varepsilon^2, &amp; s = 0, \\ (\\theta_s + \\theta_1 \\theta_{s+1} + \\cdots + \\theta_{q-s} \\theta_q) \\sigma_\\varepsilon^2, &amp; 1 \\le s \\le q, \\ 0, &amp; s &gt; q. \\end{cases} $$ A finite‑order $\\text{MA}(q)$ model is always stationary, with no constraints on $\\theta$ needed. Invertibility A stationary $\\text{AR}(p)$ process can be rewritten as an $\\text{MA}(\\infty)$ process. For example, $$ \\begin{aligned} y_t &amp;= \\phi_1 y_{t-1} + \\varepsilon_t \\ &amp;= \\phi_1(\\phi_1 y_{t-2} + \\varepsilon_{t-1}) + \\varepsilon_t \\ &amp;= \\phi_1^2 y_{t-2} + \\phi_1 \\varepsilon_{t-1} + \\varepsilon_t \\ &amp;= \\phi_1^3 y_{t-3} + \\phi_1^2 \\varepsilon_{t-2} + \\phi_1 \\varepsilon_{t-1} + \\varepsilon_t \\ &amp;= \\cdots. \\end{aligned} $$ If $|\\phi_1| &lt; 1$, then $\\phi_1^{\\infty} y_{t-\\infty} = 0$. Conversely, an invertible $\\text{MA}(q)$ can be represented as a convergent $\\text{AR}(\\infty)$: $$ \\begin{aligned} y_t &amp;= \\varepsilon_t + \\theta_1 \\varepsilon_{t-1} \\ &amp;= \\varepsilon_t + \\theta_1 (y_{t-1} - \\theta_1 \\varepsilon_{t-2}) \\ &amp;= \\varepsilon_t + \\theta_1 y_{t-1} + \\theta_1^2 \\varepsilon_{t-2} \\ &amp;= \\varepsilon_t + \\theta_1 y_{t-1} + \\theta_1^2 y_{t-2} + \\theta_1^3 \\varepsilon_{t-3} \\ &amp;= \\cdots \\ &amp;= \\theta_1 y_{t-1} + \\theta_1^2 y_{t-2} + \\theta_1^3 y_{t-3} + \\cdots + \\varepsilon_t. \\end{aligned} $$ Invertibility of $\\text{MA}(q)$ can also be expressed via the characteristic equation $$ 1 - \\theta_1 z - \\theta_2 z^2 - \\cdots - \\theta_q z^q = 0. $$ The condition is similar to the AR case: Invertible: no unit roots, $\\forall |\\lambda_i| &lt; 1$. Non‑invertible: at least one unit root, $\\exists |\\lambda_i| = 1$. For $\\text{MA}(1)$, the characteristic equation is $$ \\lambda^1 - \\theta_1 \\lambda^0 = \\lambda - \\theta_1 = 0, $$ so $|\\theta_1| &lt; 1$ ensures invertibility. For $\\text{MA}(2)$, $$ \\lambda^2 - \\theta_1 \\lambda^1 - \\theta_2 \\lambda^0 = \\lambda^2 - \\theta_1 \\lambda - \\theta_2 = 0, $$ and invertibility holds if $|\\theta_2| &lt; 1$ and $\\theta_2 \\pm \\theta_1 &lt; 1$. An invertible and a non‑invertible $\\text{MA}(q)$ can produce the same first and second moments. For example, consider an invertible $\\text{MA}(1)$: $$ y_t = c + \\varepsilon_t + \\theta_1 \\varepsilon_{t-1}, \\quad \\varepsilon_t \\sim D(0, \\sigma_\\varepsilon^2). $$ Mean: $E(y_t) = c$. Variance: $Var(y_t) = \\sigma_\\varepsilon^2 + \\theta_1^2 \\sigma_\\varepsilon^2$. And a non‑invertible counterpart: $$ y_t^* = c + \\varepsilon_t^* + \\frac{1}{\\theta_1} \\varepsilon_{t-1}^*, \\quad \\varepsilon_t^* \\sim D(0, \\theta_1^2 \\sigma_\\varepsilon^2). $$ Mean: $E(y_t^*) = c$. Variance: $Var(y_t^*) = \\theta_1^2 \\sigma_\\varepsilon^2 + \\sigma_\\varepsilon^2$. However, the two processes fundamentally differ in how $\\varepsilon_t$ is computed: Non‑invertible model: $\\varepsilon_t$ depends on future values $y_{t+1}, y_{t+2}, ...$. Invertible model: $\\varepsilon_t$ depends only on current and past values $y_t, y_{t-1}, ...$. When performing forecasting or maximum likelihood estimation, we need to compute $\\varepsilon_t$, and an invertible $\\text{MA}(q)$ is much more convenient. In other words, only invertible $\\text{MA}(q)$ models are usable for forecasting. Forecast Error $\\text{MA}(1)$ Model: $$ y_{t+1} = c + \\varepsilon_{t+1} + \\theta_1 \\varepsilon_t. $$ Conditional expectation: $$ \\hat{y}{t+1|t} = c + E(\\varepsilon{t+1} | I_t) + \\theta_1 E(\\varepsilon_t | I_t). $$ Assume we have an invertible model. Given $I_t = {y_t, y_{t-1}, ..., y_1}$, $c, \\theta_1, y_i$ are constants, and we can compute residuals $\\varepsilon_t$ recursively: $y_1 = c + \\varepsilon_1 \\ \\to\\ \\varepsilon_1 = y_1 - c$. $y_2 = c + \\varepsilon_2 + \\theta_1 \\varepsilon_1 \\ \\to\\ \\varepsilon_2 = y_2 - c - \\theta_1 (y_1 - c)$. ... $y_t = c + \\varepsilon_t + \\theta_1 \\varepsilon_{t-1} \\ \\to\\ \\varepsilon_t = y_t - c - \\theta_1 \\varepsilon_{t-1}$. Unconditionally, $E(\\varepsilon_t) = 0$, but under $I_t$, $\\varepsilon_i$ $(i = 1, ..., t)$ are constants, so $$ E(\\varepsilon_{t+i} | I_t) = \\begin{cases} 0, &amp; i &gt; 0, \\ \\varepsilon_{t+i}, &amp; i \\le 0. \\end{cases} $$ Hence $$ \\hat{y}_{t+1|t} = c + \\theta_1 \\varepsilon_t. $$ One‑step forecast error: $e_t(1) = y_{t+1} - \\hat{y}{t+1|t} = \\varepsilon{t+1}$. Variance: $Var(e_t(1)) = \\sigma_\\varepsilon^2$. For the two‑step forecast, $$ \\hat{y}{t+2|t} = c + \\theta_1 E(\\varepsilon{t+1} | I_t) = c. $$ In general, for $l \\ge 2$, $$ \\hat{y}_{t+l|t} = c. $$ $\\text{MA}(2)$ Model: $$ y_{t+1} = c + \\varepsilon_{t+1} + \\theta_1 \\varepsilon_t + \\theta_2 \\varepsilon_{t-1}. $$ One‑step conditional expectation: $$ \\hat{y}{t+1|t} = c + \\theta_1 \\varepsilon_t + \\theta_2 \\varepsilon{t-1}. $$ One‑step forecast error: $e_t(1) = \\varepsilon_{t+1}$. Variance: $Var(e_t(1)) = \\sigma_\\varepsilon^2$. Two‑step model: $$ y_{t+2} = c + \\varepsilon_{t+2} + \\theta_1 \\varepsilon_{t+1} + \\theta_2 \\varepsilon_t. $$ Two‑step conditional expectation: $$ \\hat{y}_{t+2|t} = c + \\theta_2 \\varepsilon_t. $$ Two‑step forecast error: $e_t(2) = \\varepsilon_{t+2} + \\theta_1 \\varepsilon_{t+1}$. Variance: $Var(e_t(2)) = (1 + \\theta_1^2) \\sigma_\\varepsilon^2$. For $l \\ge 3$, $\\hat{y}_{t+l|t} = c$. General $\\text{MA}(q)$ Model: $$ y_{t+l} = c + \\varepsilon_{t+l} + \\theta_1 \\varepsilon_{t+l-1} + \\theta_2 \\varepsilon_{t+l-2} + \\cdots + \\theta_q \\varepsilon_{t+l-q}. $$ Given $I_t$, we have $$ \\begin{aligned} \\hat{y}{t+l|t} &amp;= E(y{t+l} | I_t) \\ &amp;= c + E(\\varepsilon_{t+l} | I_t) + \\theta_1 E(\\varepsilon_{t+l-1} | I_t) + \\cdots + \\theta_q E(\\varepsilon_{t+l-q} | I_t). \\end{aligned} $$ Thus $$ \\hat{y}{t+l|t} = \\begin{cases} c + \\theta_l \\varepsilon_t + \\theta{l+1} \\varepsilon_{t-1} + \\cdots + \\theta_q \\varepsilon_{t+l-q}, &amp; l \\le q, \\ c, &amp; l &gt; q. \\end{cases} $$ Forecast error: $$ e_t(l) = y_{t+l} - \\hat{y}{t+l|t} = \\varepsilon{t+l} + \\theta_1 \\varepsilon_{t+l-1} + \\cdots + \\theta_q \\varepsilon_{t+l-q}. $$ Variance: $$ Var(e_t(l)) = (1 + \\theta_1^2 + \\theta_2^2 + \\cdots + \\theta_{l-1}^2) \\sigma_\\varepsilon^2. $$ If $\\varepsilon_t$ is normal, the 95% $l$‑step prediction interval for $y_{t+l}$ is $$ \\hat{y}_{t+l|t} \\pm 1.96 \\sqrt{Var(e_t(l))}. $$ Autocorrelation Function For $\\text{MA}(q)$, $$ \\gamma_s = Cov(y_t, y_{t-s}) = E[(y_t - c)(y_{t-s} - c)] = \\begin{cases} (1 + \\theta_1^2 + \\cdots + \\theta_q^2) \\sigma_\\varepsilon^2, &amp; s = 0, \\ (\\theta_s + \\theta_1 \\theta_{s+1} + \\cdots + \\theta_{q-s} \\theta_q) \\sigma_\\varepsilon^2, &amp; 1 \\le s \\le q, \\ 0, &amp; s &gt; q. \\end{cases} $$ The autocorrelation is $$ \\rho_s = \\begin{cases} 1, &amp; s = 0, \\ \\dfrac{\\theta_s + \\theta_1 \\theta_{s+1} + \\cdots + \\theta_{q-s} \\theta_q}{1 + \\theta_1^2 + \\cdots + \\theta_q^2}, &amp; 1 \\le s \\le q, \\ 0, &amp; s &gt; q. \\end{cases} $$ Since an invertible $\\text{MA}(q)$ can be represented as an $\\text{AR}(\\infty)$, the PACF of $\\text{MA}(q)$ is an infinite‑length tail function and is usually not written out explicitly. Parameter Estimation Innovations Algorithm For a zero‑mean $\\text{AR}(t)$ process: One‑step forecast: $\\hat{y}{t+1} = \\sum{i=0}^t \\phi_{t,i} y_{t-i} = \\boldsymbol{\\phi_t y_t}$. One‑step forecast error: $u_{t+1} = y_{t+1} - \\hat{y}_{t+1}$. Forecast variance: $v_t = Var(u_{t+1}) = E[(y_{t+1} - \\hat{y}_{t+1})^2] = \\gamma_0 - \\boldsymbol{\\phi_t \\gamma_t}$. $v_t$ can be viewed as a mean‑squared error (MSE) function. As long as $v_{t+1} &lt; v_t$, the estimate $\\boldsymbol{\\hat{\\phi}_{t+1}}$ is better than $\\boldsymbol{\\hat{\\phi}_t}$. The one‑step forecast error $u_{t+1}$ is the innovation—new information in $y_{t+1}$ not contained in $I_t = {y_t, ..., y_1}$. Thus $u_{t+1}$ is uncorrelated with $I_t$, and also $u_t$ is uncorrelated with $u_s$ for $t \\ne s$. Given observations ${y_t, ..., y_1}$ and forecasts ${\\hat{y}_t, ..., \\hat{y}_1}$, with initial best estimate $\\hat{y}1 = 0$, we can express $y{t+1}$ as a linear combination of past innovations: $$ y_{t+1} = \\sum_{i=1}^t \\theta_{t,i} (y_{t+1-i} - \\hat{y}_{t+1-i}). $$ This is the innovations algorithm, whose coefficients can be updated iteratively: Linear prediction coefficients: $$ \\theta_{t,t-k} = \\frac{\\gamma_{t-k} - \\sum_{j=0}^{k-1} \\theta_{k,k-j} \\theta_{t,t-j} v_j}{v_k}. $$ Forecast variance: $$ v_t = \\gamma_0 - \\sum_{i=0}^{t-1} \\theta_{t,t-i}^2 v_{i+1}. $$ For clarity, here are the first three iterations (with $v_0 = \\gamma_0$): First iteration: $\\theta_{1,1} = \\dfrac{\\gamma_1}{v_0}$. $v_1 = \\gamma_0 - \\theta_{1,1}^2 v_0$. Second iteration: $\\theta_{2,2} = \\dfrac{\\gamma_2}{v_0}$. $\\theta_{2,1} = \\dfrac{\\gamma_1 - \\theta_{1,1} \\theta_{2,2} v_0}{v_1}$. $v_2 = \\gamma_0 - \\theta_{2,2}^2 v_0 - \\theta_{2,1}^2 v_1$. Third iteration: $\\theta_{3,3} = \\dfrac{\\gamma_3}{v_0}$. $\\theta_{3,2} = \\dfrac{\\gamma_2 - \\theta_{1,1} \\theta_{3,3} v_0}{v_1}$. $\\theta_{3,1} = \\dfrac{\\gamma_1 - (\\theta_{2,2} \\theta_{3,3} v_0 + \\theta_{2,1} \\theta_{3,2} v_1)}{v_2}$. $v_3 = \\gamma_0 - \\theta_{3,3}^2 v_0 - \\theta_{3,2}^2 v_1 - \\theta_{3,1}^2 v_2$. The innovations algorithm is often used to estimate $\\text{MA}(q)$ parameters. Steps: Choose order $q$. Compute autocovariances $\\gamma_0, ..., \\gamma_q$. For $m = 1, ..., q$: Initialize $v_0 = \\gamma_0$, $\\theta_{0,0} = 1$. Iteratively compute $\\theta_{m,1}, ..., \\theta_{m,m}, v_m$. Hannan–Rissanen For an $\\text{AR}(p)$ model, one can estimate parameters by least squares on $$ \\begin{bmatrix} y_p &amp; y_{p-1} &amp; \\cdots &amp; y_1 \\ y_{p+1} &amp; y_p &amp; \\cdots &amp; y_2 \\ \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\ y_{n-1} &amp; y_{n-2} &amp; \\cdots &amp; y_{n-p} \\end{bmatrix} \\begin{bmatrix} \\phi_1 \\ \\phi_2 \\ \\vdots \\ \\phi_p \\end{bmatrix} \\begin{bmatrix} y_{p+1} \\ y_{p+2} \\ \\vdots \\ y_n \\end{bmatrix} \\ \\to\\ \\boldsymbol{Y \\phi = y}. $$ This does not work directly for $\\text{MA}(q)$, because $y_t$ is observable but $\\varepsilon_t$ is not. The Hannan–Rissanen algorithm estimates an $\\text{ARMA}(p,q)$ process by first approximating the errors with residuals $e_t$ and then regressing on both lags and residuals. Steps: Estimate past errors Fit a high‑order $\\text{AR}(m)$ model with $m &gt; \\max(p, q)$ using Yule–Walker to obtain $\\hat{\\phi}_1, ..., \\hat{\\phi}_m$, and compute residuals $$ e_t = y_t - \\hat{\\phi}1 y{t-1} - \\cdots - \\hat{\\phi}m y{t-m}. $$ Joint regression Build the matrix $$ \\boldsymbol{A} = \\begin{bmatrix} y_{m+q} &amp; y_{m+q-1} &amp; \\cdots &amp; y_{m+q+1-p} &amp; e_{m+q} &amp; e_{m+q-1} &amp; \\cdots &amp; e_{m+1} \\ y_{m+q+1} &amp; y_{m+q} &amp; \\cdots &amp; y_{m+q+2-p} &amp; e_{m+q+1} &amp; e_{m+q} &amp; \\cdots &amp; e_{m+2} \\ \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots &amp; \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\ y_{n-1} &amp; y_{n-2} &amp; \\cdots &amp; y_{n-p} &amp; e_{n-1} &amp; e_{n-2} &amp; \\cdots &amp; e_{n-q} \\end{bmatrix}. $$ Define $\\boldsymbol{\\beta} = (\\phi_1, ..., \\phi_p, \\theta_1, ..., \\theta_q)$, solve $\\boldsymbol{A \\beta = y}$ to get $\\boldsymbol{\\hat{\\beta}}$, and compute ARMA residuals $$ \\tilde{e}t = \\begin{cases} 0, &amp; t \\le \\max(p, q), \\ y_t - \\sum{j=1}^p \\hat{\\phi}j y{t-j} - \\sum_{j=1}^q \\hat{\\theta}j \\tilde{e}{t-j}, &amp; t &gt; \\max(p, q). \\end{cases} $$ Refine parameter estimates Using $\\tilde{e}_t$, construct $$ \\boldsymbol{\\tilde{A}} = \\begin{bmatrix} v_{t-1} &amp; v_{t-2} &amp; \\cdots &amp; v_{t-p} &amp; w_{t-1} &amp; w_{t-2} &amp; \\cdots &amp; w_{t-q} \\ v_{t-2} &amp; v_{t-3} &amp; \\cdots &amp; v_{t-p-1} &amp; w_{t-2} &amp; w_{t-3} &amp; \\cdots &amp; w_{t-q-1} \\ \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots &amp; \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\ v_{n-1} &amp; v_{n-2} &amp; \\cdots &amp; v_{n-p} &amp; w_{n-1} &amp; w_{n-2} &amp; \\cdots &amp; w_{n-q} \\end{bmatrix}, $$ where $v_t = \\sum_{j=1}^p \\hat{\\phi}j v{t-j} + \\tilde{e}_t$, $w_t = \\sum_{j=1}^q -\\hat{\\theta}j w{t-j} + \\tilde{e}_t$. Solve $\\boldsymbol{\\tilde{A} \\beta^\\dagger = \\tilde{e}}$ to obtain $\\boldsymbol{\\hat{\\beta}^\\dagger}$ and update parameters $\\boldsymbol{\\tilde{\\beta}} = \\boldsymbol{\\hat{\\beta}} + \\boldsymbol{\\hat{\\beta}^\\dagger}$. This third step is a bias correction and is applied only if the step‑2 estimates correspond to a stationary AR part and an invertible MA part. Unit Root Tests Before analyzing or modeling a time series, we need to determine whether it contains trends. Unlike deterministic linear trends, stochastic trends cannot be reliably judged by eye. When differencing is required, we must also choose an appropriate differencing order: Remove trends without over‑differencing, which would increase model error. Unit roots provide a good basis for these decisions: If an $\\text{AR}(p)$ process has unit roots close to 1, the series needs differencing. If an $\\text{MA}(q)$ process has unit roots close to 1, the series is likely over‑differenced. A time series without unit roots but with trend is called trend‑stationary. Its trend can be removed via regression without differencing, avoiding unnecessary noise. A time series with unit roots is called a unit root process, or difference‑stationary. If we apply OLS directly to a unit root process, the estimates are biased and t‑statistics for coefficients are invalid. We cannot use regression to remove the trend and must rely on differencing. Unit root tests are statistical tools to detect these roots. An important application is to select the differencing order: repeatedly difference the data until unit roots are no longer significant. ADF Test Letting $\\phi_1 = 1$ in an $\\text{AR}(1)$ process yields three common non‑stationary forms: $y_t = \\phi_1 y_{t-1} + \\varepsilon_t$ (RW) $y_t = c + \\phi_1 y_{t-1} + \\varepsilon_t$ (RWD) $y_t = c_1 + c_2 t + \\phi_1 y_{t-1} + \\varepsilon_t$ (RWD+DT) If we used standard t‑tests on $\\phi_1$: Null: $H_0: \\phi_1 = 1$. Estimate $\\hat{\\phi}_1$ by OLS. Compute $t = (\\hat{\\phi}_1 - 1)/SE(\\hat{\\phi}_1)$. Compare with critical values from the t‑distribution. The problem is that when $\\phi_1 = 1$, the errors $\\varepsilon_t$ are heteroscedastic, so OLS is biased and the t‑test is invalid. The Dickey–Fuller (DF) test solves this by reparameterizing. Starting from $$ y_t = \\phi_1 y_{t-1} + \\varepsilon_t, $$ subtract $y_{t-1}$ from both sides: $\\Delta y_t = \\gamma y_{t-1} + \\varepsilon_t$. $\\Delta y_t = c + \\gamma y_{t-1} + \\varepsilon_t$. $\\Delta y_t = c_1 + c_2 t + \\gamma y_{t-1} + \\varepsilon_t$. where $\\Delta y_t = y_t - y_{t-1}$ and $\\gamma = \\phi_1 - 1$. The DF test uses $H_0: \\gamma = 0$ (unit root present). $H_1: \\gamma &lt; 0$ (no unit root). Estimate $\\hat{\\gamma}$ via OLS and compute $$ \\tau = \\frac{\\hat{\\gamma}}{SE(\\hat{\\gamma})}. $$ The distribution of $\\tau$ is non‑standard. Dickey and Fuller computed its critical values via Monte Carlo simulation. Note that the three forms above (no constant, constant, constant + trend) have different critical value tables. If the DF statistic is less than the critical value (one‑sided test), we reject $H_0$ and conclude no unit root. DF assumes an $\\text{AR}(1)$ process and requires uncorrelated errors. For higher‑order $\\text{AR}(p)$, unmodeled dynamics leak into the error term, causing autocorrelation. The augmented Dickey–Fuller (ADF) test extends DF by adding lagged differences to absorb serial correlation in the errors. Starting from an $\\text{AR}(p)$ model, we repeatedly add and subtract terms: Add and subtract $\\phi_p y_{t-p+1}$: $$ y_t = c + \\phi_1 y_{t-1} + \\cdots + (\\phi_{p-1} + \\phi_p) y_{t-p+1} - \\phi_p \\Delta y_{t-p+1} + \\varepsilon_t. $$ Add and subtract $(\\phi_{p-1} + \\phi_p) y_{t-p+2}$: $$ y_t = c + \\phi_1 y_{t-1} + \\cdots + (\\phi_{p-2} + \\phi_{p-1} + \\phi_p) y_{t-p+2} - (\\phi_{p-1} + \\phi_p) \\Delta y_{t-p+2} - \\phi_p \\Delta y_{t-p+1} + \\varepsilon_t. $$ Add and subtract $(\\phi_{p-2} + \\phi_{p-1} + \\phi_p) y_{t-p+3}$, and so on. Eventually we obtain $$ y_t = c + \\Big(\\sum_{i=1}^p \\phi_i\\Big) y_{t-1} - \\sum_{i=2}^p \\Big(\\sum_{j=i}^p \\phi_j\\Big) \\Delta y_{t-i+1} + \\varepsilon_t. $$ In differenced form, $$ \\Delta y_t = c + \\gamma y_{t-1} + \\sum_{i=2}^p \\beta_i \\Delta y_{t-i+1} + \\varepsilon_t, $$ where $\\gamma = -\\big(1 - \\sum_{i=1}^p \\phi_i\\big)$ and $\\beta_i = \\sum_{j=i}^p \\phi_j$. ADF supports the following three forms: $\\Delta y_t = \\gamma y_{t-1} + \\sum_{i=2}^p \\beta_i \\Delta y_{t-i+1} + \\varepsilon_t$. $\\Delta y_t = c + \\gamma y_{t-1} + \\sum_{i=2}^p \\beta_i \\Delta y_{t-i+1} + \\varepsilon_t$. $\\Delta y_t = c_1 + c_2 t + \\gamma y_{t-1} + \\sum_{i=2}^p \\beta_i \\Delta y_{t-i+1} + \\varepsilon_t$. These three correspond to different critical value tables and must be specified when testing. A necessary condition for stationarity of $\\text{AR}(p)$ is $\\sum_{i=1}^p \\phi_i &lt; 1$, so ADF uses the same null and alternative as DF: $H_0: \\gamma = 0$ (unit root). $H_1: \\gamma &lt; 0$ (no unit root). When using ADF we must choose a lag order $p$ and a deterministic form. Lag selection can be done by: Starting from a large $p$ and gradually reducing it until all lagged differences are insignificant (if all are insignificant even for large $p$, a unit root is likely present). Using information criteria. An unknown $\\text{ARIMA}(p, d, q)$ process can be approximated by an $\\text{ARIMA}(n, 1, 0)$ process with $n = T^{1/3}$. Thus ADF can handle data generating processes with unknown MA components $\\text{MA}(q)$. However, ADF often has low power, especially for small samples and processes with strong autocorrelation. KPSS Test ADF has a key limitation: the unit root is the null hypothesis, and low power means that non‑unit root processes are often misclassified as unit root. The KPSS test (Kwiatkowski–Phillips–Schmidt–Shin) addresses this by reversing the roles: Null: stationarity. Alternative: unit root. LBI Test Consider the following state‑space representation of a dynamic system: Observation equation: $y_t = x_t \\beta_t + z_t&#39; \\gamma + \\varepsilon_t$. State equation: $\\beta_t = \\beta_{t-1} + u_t$. The observation has two parts: $z_t&#39; \\gamma$: time‑invariant component (constant regression coefficients $\\gamma$). $x_t \\beta_t$: time‑varying component (coefficients $\\beta_t$ change over time). There are two error terms: Measurement error $\\varepsilon_t \\sim N(0, \\sigma_\\varepsilon^2)$ with $\\sigma_\\varepsilon^2 &gt; 0$. Dynamic noise $u_t \\sim N(0, \\sigma_u^2)$ with $\\sigma_u^2 \\ge 0$. This is a varying‑coefficient regression model: $y_t \\in \\mathbb{R}$ is the dependent variable. $x_t, z_t \\in \\mathbb{R}^n$ are regressors. $\\beta_t, \\gamma \\in \\mathbb{R}^n$ are coefficients. $\\sigma_\\varepsilon^2$ is the variance of the observation error. $\\sigma_u^2$ is the variance of the state innovation. When $\\sigma_u^2 &gt; 0$, the time‑varying coefficient $\\beta_t$ follows a random walk. If $y_t$ is normal, the LBI (Locally Best Invariant) test can be used to test constancy of $\\beta_t$: $H_0: \\rho = \\sigma_\\varepsilon^2 / \\sigma_u^2 = 0$ (no relation between observation and state errors). $H_1: \\rho &gt; 0$ (relation exists). The KPSS test is a special case with $x_t = 1$, $\\beta_t = r_t$, $z_t = t$, $\\gamma = \\xi$: $$ y_t = \\xi t + r_t + \\varepsilon_t, $$ where Deterministic trend: $\\xi t$. Random walk: $r_t = r_{t-1} + u_t$, with $r_0$ initial and $u_t \\sim N(0, \\sigma_u^2)$. Stationary error: $\\varepsilon_t \\sim N(0, \\sigma_\\varepsilon^2)$. The KPSS hypotheses are $H_0: \\sigma_u^2 = 0$ (stationary). $H_1: \\sigma_u^2 \\ne 0$ (non‑stationary). KPSS supports two cases: $\\xi \\ne 0$: $H_0$ is trend stationarity. $\\xi = 0$: $H_0$ is level stationarity. When $\\varepsilon_t \\sim N(0, \\sigma_\\varepsilon^2)$, KPSS is a standard LM test (LBI is a special case). Steps: Compute residuals $e_i$: If $\\xi \\ne 0$, regress $y_t$ on $t$ by OLS: $y_t = r_0 + \\xi t$. If $\\xi = 0$, use $e_i = y_i - \\bar{y}$. Compute LM statistic: Cumulative sum of residuals: $S_t = \\sum_{i=1}^t e_i$. Residual variance: $\\hat{\\sigma}_\\varepsilon^2 = Var(e_i)$. LM statistic: $LM = \\sum_{t=1}^T S_t^2 / \\hat{\\sigma}_\\varepsilon^2$. Compare to critical values (or p‑values) to decide whether to reject $H_0$. In practice, errors are usually serially correlated, so $\\varepsilon_t \\sim N(0, \\sigma_\\varepsilon^2)$ is too strong. KPSS therefore uses a weaker asymptotic framework. Define long‑run variance $\\sigma^2 = \\lim_{T \\to \\infty} E(S_T^2)/T$. Estimate it via a Newey–West‑type estimator $$ s^2(l) = T^{-1} \\sum_{t=1}^T e_t^2 + 2T^{-1} \\sum_{s=1}^l w(s, l) \\sum_{t=s+1}^T e_t e_{t-s}, $$ where $w(s, l) = 1 - s/(l+1)$ is a Bartlett kernel and $l = o(T^{1/2})$ is the lag truncation. Use $s^2(l)$ as $\\hat{\\sigma}_\\varepsilon^2$ in the LM statistic. Normalize by $T^{-2}$: $\\eta = T^{-2} \\sum S_t^2$. Define $\\hat{\\eta}\\mu = \\eta\\mu / s^2(l)$ (level stationarity). $\\hat{\\eta}\\tau = \\eta\\tau / s^2(l)$ (trend stationarity). The asymptotic distributions are $\\eta_\\mu \\to \\sigma^2 \\int_0^1 V(r)^2 dr$. $\\eta_\\tau \\to \\sigma^2 \\int_0^1 V_2(r)^2 dr$. Here $V(r) = W(r) - r W(1)$ is a Brownian bridge with $V(r) \\sim N(0, r(1-r))$ on $[0,1]$. Simulations give critical values for $\\int_0^1 V(r)^2 dr$ and $\\int_0^1 V_2(r)^2 dr$, leading to the KPSS table: | Statistic &amp; quantile | 0.10 | 0.05 | 0.025 | 0.01 | | :--- | :---: | :---: | :---: | :---: | :---: | | $\\eta_\\mu$: $\\int_0^1 V(r)^2 dr$ | 0.347 | 0.463 | 0.574 | 0.739 | | $\\eta_\\tau$: $\\int_0^1 V_2(r)^2 dr$ | 0.119 | 0.146 | 0.176 | 0.216 | By comparing $\\hat{\\eta}\\mu$ or $\\hat{\\eta}\\tau$ to these critical values, we obtain the p‑value and decide whether to reject stationarity. CH Test The Canova–Hansen (CH) test is a LM‑type test for seasonal unit roots and non‑constant seasonal patterns. Model: $$ y_t = \\mu + x_t&#39; \\beta + S_t + \\varepsilon_t, $$ where $x_t, \\beta \\in \\mathbb{R}^k$ are regressors and coefficients. $\\varepsilon_t \\sim N(0, \\sigma^2)$ may be heteroscedastic. $S_t$ is a deterministic seasonal component with period $s$ (odd integer). Two ways to model $S_t$: Using $s$ seasonal dummies $S_t = d_t&#39; \\alpha$: $d_t \\in \\mathbb{R}^s$ are seasonal dummies. $\\alpha \\in \\mathbb{R}^s$ are coefficients. Using $2q + 1$ trigonometric frequencies $S_t = f_t&#39; \\gamma = \\sum_{j=1}^q f_{jt}&#39; \\gamma_j$ with $q = s/2$: $\\gamma \\in \\mathbb{R}^q$ are coefficients. Trigonometric terms $$ f_{jt} = \\begin{cases} (\\cos(2 \\pi j t / s), \\sin(2 \\pi j t / s)), &amp; j &lt; q, \\ (\\cos(\\pi t)), &amp; j = q. \\end{cases} $$ To ensure test power, we require: The tested series $y_t$ should not contain non‑seasonal unit roots; otherwise CH loses power. If $x_t$ includes lags $y_{t-1}, y_{t-2}, ...$, the AR part should not absorb seasonal patterns; usually we limit this to first‑order lags. Rewriting the regressions: Dummy seasonal model: $y_t = x_t&#39; \\beta + d_t&#39; \\alpha + e_t$. Trigonometric seasonal model: $y_t = \\mu + x_t&#39; \\beta + f_t&#39; \\gamma + e_t$. After auxiliary regression, the residuals $\\hat{e}_t$ are (asymptotically) the same in both specifications. Robust Covariance Estimation Let $D_n = [d_1 e_1, ..., d_n e_n]$, and $\\Omega = \\lim_{n \\to \\infty} n^{-1} E(D_n D_n&#39;)$. $F_n = [f_1 e_1, ..., f_n e_n]$, and $\\Omega^f = \\lim_{n \\to \\infty} n^{-1} E(F_n F_n&#39;)$. Due to seasonality, $\\hat{e}_t$ is often heteroscedastic and serially correlated, so we use robust estimators: $\\hat{\\Omega} = \\sum_{k=-m}^m w(k/m) \\frac{1}{n} \\sum_i d_{i+k} \\hat{e}_{i+k} d_i&#39; \\hat{e}_i$. $\\hat{\\Omega}^f = \\sum_{k=-m}^m w(k/m) \\frac{1}{n} \\sum_i f_{i+k} \\hat{e}_{i+k} f_i&#39; \\hat{e}_i$. Here $w(\\cdot)$ is a kernel ensuring positive semidefiniteness. Seasonal Unit Root Test Assume the trigonometric coefficients follow a random walk $$ \\gamma_t = \\gamma_{t-1} + u_t. $$ If $E(u_t u_t&#39;)$ is full rank, all frequencies in $y_t$ have seasonal unit roots. Let $A \\in \\mathbb{R}^{(s-1) \\times a}$ be a full‑rank selection matrix to choose $a$ frequencies of interest: $A = I_{s-1}$: test all of $\\gamma$. $A = (\\tilde{0}, 1)$: test the last frequency $\\pi$ (period 2). $A = (\\tilde{0}, I_2, \\tilde{0})$: test a specific frequency $j \\pi / q$. The random walk can be written as $$ A&#39; \\gamma_t = A&#39; \\gamma_{t-1} + u_t, $$ with $E(u_t u_t&#39;) = \\tau^2 G$, where $\\tau^2 \\ge 0$ and $G = (A&#39; \\Omega^f A)^{-1}$. $\\tau^2 = 0$: coefficients are constant ($\\gamma_t = \\gamma_0$). $\\tau^2 &gt; 0$: the selected seasonal frequencies have unit roots. CH hypotheses: $H_0: \\tau^2 = 0$ (no seasonal unit roots). $H_1: \\tau^2 &gt; 0$ (seasonal unit roots present). The LM statistic is $$ L = \\frac{1}{n^2} (A&#39; \\hat{\\Omega}^f A)^{-1} A&#39; \\sum_{i=1}^n \\hat{F}_i \\hat{F}_i&#39; A. $$ Under $H_0$, $L$ converges to a von Mises distribution $\\text{VM}(a)$, where $a = \\text{rank}(A)$. Non‑constant Seasonal Pattern Test Similarly, for the dummy seasonal model with $\\alpha_t$ following $$ A&#39; \\alpha_t = A&#39; \\alpha_{t-1} + u_t, $$ and $E(u_t u_t&#39;) = \\tau^2 G$, we have $\\tau^2 = 0$: constant seasonal pattern ($\\alpha_t = \\alpha_0$). $\\tau^2 \\ne 0$: non‑constant seasonal pattern (e.g., random walk or occasional breaks). CH now tests $H_0: \\tau^2 = 0$ (constant seasonal pattern). $H_1: \\tau^2 \\ne 0$ (non‑constant seasonal pattern). The LM statistic is $$ L = \\frac{1}{n^2} (A&#39; \\hat{\\Omega} A)^{-1} A&#39; \\sum_{i=1}^n \\hat{D}_i \\hat{D}_i&#39; A, $$ with the same $\\text{VM}(a)$ asymptotic distribution. von Mises Distribution Critical values for $\\text{VM}(p)$ with degrees of freedom $p$ are: p 1% 2.5% 5% 7.5% 10% 20% 1 .748 .593 .470 .398 .353 .243 2 1.070 .898 .749 .670 .610 .469 3 1.350 1.160 1.010 .913 .846 .679 4 1.600 1.390 1.240 1.140 1.070 .883 5 1.880 1.630 1.470 1.360 1.280 1.080 6 2.120 1.890 1.680 1.580 1.490 1.280 7 2.350 2.100 1.900 1.780 1.690 1.460 8 2.590 2.330 2.110 1.990 1.890 1.660 9 2.820 2.550 2.320 2.190 2.100 1.850 10 3.050 2.760 2.540 2.400 2.290 2.030 11 3.270 2.990 2.750 2.600 2.490 2.220 12 3.510 3.180 2.960 2.810 2.690 2.410 Response surface regressions can be used to obtain p‑values for arbitrary sample sizes and seasonal periods, e.g.: https://github.com/GeoBosh/uroot/blob/master/R/ch-rs-pvalue. ARMA and ARIMA Models An autoregressive moving average (ARMA) model combines both lags and past errors: $$ \\text{ARMA}(p, q):\\quad y_t = c + \\phi_1 y_{t-1} + \\cdots + \\phi_p y_{t-p} + \\theta_1 \\varepsilon_{t-1} + \\cdots + \\theta_q \\varepsilon_{t-q} + \\varepsilon_t, $$ $\\phi_i$ are AR coefficients. $\\theta_i$ are MA coefficients. $\\varepsilon_t$ is white noise. The AR coefficients must satisfy stationarity and the MA coefficients must satisfy invertibility. ARMA models are less interpretable than pure AR or MA models, but they can capture rich patterns in time series. Often we difference the data before fitting ARMA to remove trends. Incorporating differencing into the model yields the ARIMA (Autoregressive Integrated Moving Average) model $\\text{ARIMA}(p, d, q)$, where $d$ is the differencing order. $\\text{ARIMA}(0, 0, 0), c = 0 \\ \\to\\ \\text{WN}$. $\\text{ARIMA}(0, 1, 0), c = 0 \\ \\to\\ \\text{RW}$. $\\text{ARIMA}(0, 1, 0), c \\ne 0 \\ \\to\\ \\text{RWD}$. $\\text{ARIMA}(p, 0, q) \\ \\to\\ \\text{ARMA}(p, q)$. The long‑run forecast behavior depends on the constant $c$ and differencing order $d$: $c = 0, d = 0$: forecasts converge to 0. $c = 0, d = 1$: forecasts converge to a nonzero constant. $c = 0, d = 2$: forecasts show a linear trend. $c \\ne 0, d = 0$: forecasts converge to the mean. $c \\ne 0, d = 1$: forecasts show a linear trend. $c \\ne 0, d = 2$: forecasts show a quadratic trend. The larger $d$ is, the faster the prediction interval widens: $d = 0$: forecast standard deviation equals that of historical data. $d = 1$: forecast standard deviation grows linearly with horizon. $d = 2$: forecast standard deviation grows faster than linearly. Using the backshift operator, $$ y_t = c + \\phi_1 B y_t + \\cdots + \\phi_p B^p y_t + \\theta_1 B \\varepsilon_t + \\cdots + \\theta_q B^q \\varepsilon_t + \\varepsilon_t. $$ Rearranging, $$ (1 - \\phi_1 B - \\cdots - \\phi_p B^p) y_t = c + (1 + \\theta_1 B + \\cdots + \\theta_q B^q) \\varepsilon_t. $$ Let $$ \\phi(B) = 1 - \\phi_1 B - \\cdots - \\phi_p B^p, \\quad \\theta(B) = 1 + \\theta_1 B + \\cdots + \\theta_q B^q, $$ then $$ \\phi(B) y_t = c + \\theta(B) \\varepsilon_t. $$ For $\\text{ARIMA}(p, d, q)$, apply differencing $y_t&#39; = (1 - B)^d y_t$: $$ (1 - \\phi_1 B - \\cdots - \\phi_p B^p) (1 - B)^d y_t = c + (1 + \\theta_1 B + \\cdots + \\theta_q B^q) \\varepsilon_t. $$ That is, $$ \\phi(B) (1 - B)^d y_t = c + \\theta(B) \\varepsilon_t, $$ with typically $d \\le 2$. For seasonal data, a simple approach is seasonal differencing $(1 - B^m) y_t$ with period $m$. A more general approach is the seasonal ARIMA model $\\text{SARIMA}(p, d, q)(P, D, Q)_m$, which adds seasonal differencing $(1 - B^m)^D$ and seasonal AR/MA polynomials $\\Phi(B^m)$, $\\Theta(B^m)$: $$ \\phi(B) \\Phi(B^m) (1 - B)^d (1 - B^m)^D y_t = c + \\theta(B) \\Theta(B^m) \\varepsilon_t, $$ usually with $D \\le 1$ and $P, Q \\le 3$. Traditional seasonal decomposition assumes the seasonal pattern repeats identically each cycle. SARIMA, in contrast, allows randomness in the seasonal pattern and models it via $\\Phi(B^m)$ and $\\Theta(B^m)$. Note that SARIMA is not suitable for multiple seasonalities or very large seasonal periods. Define $\\Delta^d \\Delta_m^D y_t = (1 - B)^d (1 - B^m)^D y_t$, $\\phi^*(B)$ as the combined AR polynomial of order $p + mP$, $\\theta^*(B)$ as the combined MA polynomial of order $q + mQ$. Then $$ \\phi^*(B) \\Delta^d \\Delta_m^D y_t = c + \\theta^*(B) \\varepsilon_t, $$ so $\\Delta^d \\Delta_m^D y_t$ follows an $\\text{ARMA}(p + mP, q + mQ)$ process with some zero coefficients. Hence SARIMA and ARMA can share the same likelihood machinery. Properties If we view an ARMA model as an operator $\\mathcal{F}(\\cdot)$ on the input $x(t)$ (the past values $y_{t-1}, ..., y_1$), then ARMA defines a linear system: $$ \\mathcal{F}[\\lambda_1 x_1(t) + \\lambda_2 x_2(t)] = \\lambda_1 \\mathcal{F}[x_1(t)] + \\lambda_2 \\mathcal{F}[x_2(t)]. $$ If the parameters are constant, the mapping does not change over time, so ARMA is time‑invariant: $$ y(t) = \\mathcal{F}[x(t)] \\ \\Rightarrow\\ y(t-\\tau) = \\mathcal{F}[x(t-\\tau)]. $$ If input and output share the same distribution, the system is stable. If the output depends only on current and past inputs and not on future inputs, the system is causal. Define $\\phi(z) = 1 - \\phi_1 z - \\cdots - \\phi_p z^p$. $\\theta(z) = 1 - \\theta_1 z - \\cdots - \\theta_q z^q$. Then Stationarity: $\\phi(z) \\ne 0$ for $|z| = 1$. Causality: $\\phi(z) \\ne 0$ for $|z| \\le 1$. Invertibility: $\\theta(z) \\ne 0$ for $|z| \\le 1$. For a causal and invertible ARMA$(p, q)$, we can write $$ y_t = \\sum_{i=0}^\\infty \\psi_i \\varepsilon_{t-i}, $$ with $$ \\psi_i = \\begin{cases} 1, &amp; i = 0, \\ \\theta_i + \\sum_{j=1}^{\\min(p, i)} \\phi_j \\psi_{i-j}, &amp; i \\ge 1. \\end{cases} $$ Because $\\varepsilon_t \\overset{iid}{\\sim} N(0, \\sigma_\\varepsilon^2)$, $$ Var(y_t) = \\sum_{j=0}^\\infty Var(\\psi_j \\varepsilon_{t-j}) = \\sigma_\\varepsilon^2 \\sum_{j=0}^\\infty \\psi_j^2. $$ For a causal ARMA$(p, q)$, let $m = \\max(p, q)$ and define $$ w_t = \\begin{cases} \\sigma^{-1} y_t, &amp; t = 1, ..., m, \\ \\sigma^{-1} \\phi(B) y_t, &amp; t &gt; m. \\end{cases} $$ Then the autocovariance of $w_t$ is $$ \\gamma_{w(i,j)} = \\begin{cases} \\sigma^{-2} \\gamma_{y(i-j)}, &amp; 1 \\le i, j \\le m, \\ \\sigma^{-2} [\\gamma_{y(i-j)} - \\sum_{r=1}^p \\phi_r \\gamma_{y(r - |i-j|)}], &amp; \\min(i, j) \\le m &lt; \\max(i, j) \\le 2m, \\ \\sum_{r=1}^q \\theta_r \\theta_{r + |i-j|}, &amp; \\min(i, j) &gt; m, \\ 0, &amp; \\text{otherwise}. \\end{cases} $$ Given $\\gamma_w$, we can apply the innovations algorithm to obtain $\\theta_{t,i}$ and $r_t$: Innovations equation $$ \\hat{w}{t+1} = \\begin{cases} \\sum{i=0}^t \\theta_{t,i} (w_{t+1-i} - \\hat{w}{t+1-i}), &amp; 1 \\le t &lt; m, \\ \\sum{i=0}^q \\theta_{t,i} (w_{t+1-i} - \\hat{w}_{t+1-i}), &amp; t \\ge m. \\end{cases} $$ Forecast variance $r_t = E[(w_{t+1} - \\hat{w}_{t+1})^2]$. The error variance satisfies $$ v_t = E[(y_{t+1} - \\hat{y}{t+1})^2] = \\sigma\\varepsilon^2 E[(w_{t+1} - \\hat{w}{t+1})^2] = \\sigma\\varepsilon^2 r_t. $$ Forecast Error For ARMA$(p, q)$, the $l$‑step‑ahead model is $$ y_{t+l} = c + \\phi_1 y_{t+l-1} + \\cdots + \\phi_p y_{t-p+l} + \\varepsilon_{t+l} + \\theta_1 \\varepsilon_{t+l-1} + \\cdots + \\theta_q \\varepsilon_{t+l-q}. $$ The conditional expectation is $$ \\begin{aligned} \\hat{y}{t+l|t} &amp;= E(y{t+l} | I_t) \\ &amp;= c + E(\\varepsilon_{t+l} | I_t) \\ &amp;\\quad + E(\\phi_1 y_{t+l-1} + \\cdots + \\phi_p y_{t-p+l} | I_t) \\ &amp;\\quad + E(\\theta_1 \\varepsilon_{t+l-1} + \\cdots + \\theta_q \\varepsilon_{t+l-q} | I_t). \\end{aligned} $$ Given $I_t = {y_t, y_{t-1}, ..., y_{t-p}}$: $E(\\varepsilon_{t+l}) = 0$. $E(\\phi_1 y_{t+l-1} + \\cdots + \\phi_p y_{t-p+l}) = \\sum_{i=1}^p \\phi_i \\hat{y}_{t+l-i|t}$. $E(\\theta_1 \\varepsilon_{t+l-1} + \\cdots + \\theta_q \\varepsilon_{t+l-q}) = \\sum_{i=1}^q \\theta_i \\varepsilon_{t+l-i}$ for $l \\le q$ and 0 for $l &gt; q$. Forecast errors: One‑step: $$ e_t(1) = y_{t+1} - \\hat{y}{t+1|t} = \\varepsilon{t+1}. $$ Two‑step (ARMA(1, 1) as illustration): $$ e_t(2) = (\\phi_1 + \\theta_1) \\varepsilon_{t+1} + \\varepsilon_{t+2}. $$ Three‑step: $$ e_t(3) = (\\phi_1 (\\phi_1 + \\theta_1) + \\phi_2 + \\theta_2) \\varepsilon_{t+1} + (\\phi_1 + \\theta_1) \\varepsilon_{t+2} + \\varepsilon_{t+3}. $$ In general, $$ e_t(l) = y_{t+l} - \\hat{y}{t+l|t} = \\sum{i=1}^l \\psi_i \\varepsilon_{t+l-i} + \\varepsilon_{t+l}, $$ where $$ \\psi_j = \\begin{cases} -1, &amp; j = 0, \\ -\\Big(\\sum_i^j \\phi_i \\psi_{j-i}\\Big) + \\theta_j, &amp; 1 \\le j \\le q, \\ -\\sum_i^j \\phi_i \\psi_{j-i}, &amp; j &gt; q. \\end{cases} $$ The variance is $$ Var(e_t(l)) = (\\psi_1^2 + \\psi_2^2 + \\cdots + \\psi_{l-1}^2 + 1) \\sigma_\\varepsilon^2. $$ Parameter Estimation and Likelihood One common approach is conditional sum of squares (CSS), followed by maximum likelihood using numerical optimization. A reference implementation can be found in R’s arima source code, e.g.: https://github.com/SurajGupta/r-source/blob/a28e609e72ed7c47f6ddfbb86c85279a0750f0b7/src/library/stats/src/arima.c#L753 https://github.com/SurajGupta/r-source/blob/master/src/library/stats/R/arima.R#L248 Given ARMA$(p, q)$, the one‑step forecast is $$ \\hat{y}{t+1} = \\begin{cases} \\sum{j=1}^q \\theta_j (y_{t+1-j} - \\hat{y}{t+1-j}), &amp; 1 \\le t &lt; \\max(p, q), \\ \\sum{j=1}^p \\phi_j y_{t+1-j} + \\sum_{j=1}^q \\theta_j (y_{t+1-j} - \\hat{y}_{t+1-j}), &amp; t \\ge \\max(p, q). \\end{cases} $$ Let $v_t = E[(y_{t+1} - \\hat{y}_{t+1})^2]$. Conditionally, $$ y_t | y_{t-1}, ..., y_1 \\sim N(\\hat{y}_t, v_t), $$ with density $$ P(y_t | y_{t-1}, ..., y_1) = \\frac{1}{\\sqrt{2 \\pi v_t}} \\exp\\Big(-\\frac{(y_t - \\hat{y}_t)^2}{2 v_t}\\Big). $$ Let $v_t = \\sigma^2 r_t$ and $S(\\phi, \\theta) = \\sum_{t=1}^n \\frac{(y_t - \\hat{y}t)^2}{r{t-1}}$. The likelihood is $$ L(\\phi, \\theta, \\sigma^2) = (2 \\pi \\sigma^2)^{-n/2} \\Big(\\prod_{t=1}^n r_{t-1}\\Big)^{-1/2} \\exp\\Big(-\\frac{S(\\phi, \\theta)}{2 \\sigma^2}\\Big). $$ The log‑likelihood is $$ -2 \\ell(\\phi, \\theta, \\sigma^2) = n \\log(2 \\pi \\sigma^2) + \\sum_{t=1}^n \\log r_{t-1} + \\frac{S(\\phi, \\theta)}{\\sigma^2}. $$ Differentiating w.r.t. $\\sigma^2$ gives $$ \\hat{\\sigma}^2 = \\frac{S(\\hat{\\phi}, \\hat{\\theta})}{n}. $$ Substituting and dropping constants yields the concentrated log‑likelihood $$ \\ell(\\phi, \\theta) = \\log\\Big(\\frac{S(\\phi, \\theta)}{n}\\Big) + \\frac{1}{n} \\sum_{t=1}^n \\log r_{t-1}. $$ $\\ell(\\phi, \\theta)$ has no closed form solution, so numerical optimization is used. From the likelihood we can compute information criteria: $AIC = -2 \\log L + 2(k + 1)$. $BIC = -2 \\log L + \\log(n)(k + 1)$. $AIC_c = AIC + \\dfrac{2(k + 1)(k + 2)}{n - k - 2}$. with $$ k = \\begin{cases} p + q + P + Q, &amp; c = 0, \\ p + q + P + Q + 1, &amp; c \\ne 0. \\end{cases} $$ Modeling Workflow Two main workflows: Box–Jenkins. Step‑wise. Box–Jenkins (manual, mainly for AR and MA): Inspect and transform data. Apply log or Box–Cox to stabilize variance. Difference to remove deterministic trends. Plot ACF and PACF. MA$(q)$: ACF cuts off at lag $q$. AR$(p)$: PACF cuts off at lag $p$. Estimate parameters using Yule–Walker, Burg, etc. Select model via AIC. Step‑wise (automatic, for general ARIMA/SARIMA): Use CH test to choose seasonal differencing order $D$. Use KPSS to choose non‑seasonal differencing order $d$. Form stationary $\\nabla y_t = (1 - B)^d (1 - B^m)^D y_t$. Fit ARMA$(p + mP, q + mQ)$ to $\\nabla y_t$. Explore different $p, q, P, Q$ and select via information criteria. Check residuals via Ljung–Box tests. A practical reference: https://otexts.com/fpp3/arima-r.html Modeling Residuals In standard linear regression we assume white noise errors. In practice, residuals $e = y - \\hat{y}$ often show autocorrelation. We then model the error term $\\eta_t$ as a zero‑mean stationary process, e.g. an ARMA$(p, q)$: $$ \\begin{aligned} y_t &amp;= \\beta_1 x_{1,t} + \\cdots + \\beta_k x_{k,t} + \\eta_t, \\ \\phi(B) \\eta_t &amp;= \\theta(B) \\varepsilon_t, \\ \\varepsilon_t &amp;\\sim \\text{WN}(0, \\sigma^2). \\end{aligned} $$ If $\\eta_t$ is non‑stationary, we can difference $y_t, x_t, \\eta_t$ and use ARIMA$(p, d, q)$ instead: $$ \\begin{aligned} y_t&#39; &amp;= \\beta_1 x_{1,t}&#39; + \\cdots + \\beta_k x_{k,t}&#39; + \\eta_t&#39;, \\ \\phi(B) \\eta_t&#39; &amp;= \\theta(B) \\varepsilon_t. \\end{aligned} $$ OLS and GLS estimates are OLS (ignoring error correlation): $$ \\boldsymbol{\\hat{\\beta}}{\\text{OLS}} = (X&#39; X)^{-1} X&#39; y, \\quad Cov(\\boldsymbol{\\hat{\\beta}}{\\text{OLS}}) = (X&#39; X)^{-1} X&#39; \\Gamma_n X (X&#39; X)^{-1}, $$ where $\\Gamma_n = E(\\boldsymbol{\\eta \\eta&#39;})$. GLS (using $\\Gamma_n^{-1}$ as weight): $$ \\boldsymbol{\\hat{\\beta}}{\\text{GLS}} = (X&#39; \\Gamma_n^{-1} X)^{-1} X&#39; \\Gamma_n^{-1} y, \\quad Cov(\\boldsymbol{\\hat{\\beta}}{\\text{GLS}}) = (X&#39; \\Gamma_n^{-1} X)^{-1}. $$ GLS provides the best linear unbiased estimator if $\\Gamma_n$ is known. In practice, $\\Gamma_n$ is built from the fitted ARMA$(p, q)$ model for $\\eta_t$. "},{"slug":"linear-regression","title":"Linear Regression","tags":["Statistics","TimeSeriesAnalysis"],"content":"Linear regression provides a simple yet powerful way to quantify relationships between variables. Its core idea is to find a linear equation that describes the relationship between two or more variables, and then use that relationship for prediction or analysis. Although real-world relationships are often more complex, linear regression remains a cornerstone of many data analysis and forecasting tasks and serves as the foundation for more advanced models. Model specification A model typically consists of two types of variables: Exogenous variables Known variables determined by factors outside the model, usually denoted by a vector $x \\in \\mathbb{R}^k$. Endogenous variables Unknown variables determined within the model, usually denoted by a scalar $y \\in \\mathbb{R}$. If $x$ and $y$ satisfy a linear relationship, we can model them using linear regression: $$y = \\beta_0 + \\beta_1x_1 + \\cdots + \\beta_kx_k + \\varepsilon$$ $\\beta \\in R^{k+1}$ 是回归系数 regression parameters $\\varepsilon \\in R$ 是随机误差 random error In matrix form, $\\boldsymbol{y = X\\beta + \\varepsilon}$ $\\boldsymbol y = (y_1, \\ldots, y_n)$ $\\boldsymbol \\beta = (\\beta_0, \\beta_1, \\ldots, \\beta_k)$ $\\boldsymbol \\varepsilon = (\\varepsilon_1, \\ldots, \\varepsilon_n)$ $\\boldsymbol X = \\begin{bmatrix} 1 &amp; x_{1,1} &amp; \\cdots &amp; x_{k,1} \\ 1 &amp; x_{1,2} &amp; \\cdots &amp; x_{k,2} \\ \\vdots &amp; \\vdots &amp; \\cdots &amp; \\vdots \\ 1 &amp; x_{1,n} &amp; \\cdots &amp; x_{k,n} \\ \\end{bmatrix}$ To make the model well-defined, we usually assume the error term is white noise, i.e. $\\boldsymbol \\varepsilon \\sim \\text{NID}(0,\\sigma^2)$. This means the errors come from random noise in the system from which the model cannot extract additional information: Independently and identically distributed. Zero mean: the systematic mean is captured by the intercept $\\beta_0$. Constant variance $\\sigma^2$: For survey data, $\\sigma^2$ depends on the population and sampling process. For sensor data, $\\sigma^2$ is determined by sensor precision. Once we estimate the coefficients $\\boldsymbol \\beta$, we obtain fitted values $\\boldsymbol{\\hat y = X\\beta}$. The difference between observations and fitted values $\\boldsymbol{e = y -\\hat y} = (e_1, \\ldots, e_n)$ is called the residual. The error term is a modeling assumption specified before estimation, whereas the residuals are realized errors after fitting the model. To judge model quality, we check whether the residuals are consistent with the original error assumptions. Common violations of model assumptions include: Non-zero mean: missing an intercept term. Nonlinear trends: linear specification is inadequate; need nonlinear terms. Autocorrelation. Heteroscedasticity. Parameter estimation There are two standard ways to estimate $\\boldsymbol \\beta$: Ordinary least squares (OLS) Maximum likelihood estimation (MLE) Under the assumption $\\boldsymbol \\varepsilon \\sim \\text{NID}(0,\\sigma^2)$, OLS and MLE yield the same estimator. OLS The model’s lack of fit is measured by the residual sum of squares $$\\sum_{i=1}^n e_i^2 = \\sum_{i=1}^n (y_i-\\hat y_i)^2$$ We seek parameters $\\boldsymbol \\beta$ that minimize this quantity: $$\\min_\\beta\\big[(\\boldsymbol {y -X \\beta})&#39;(\\boldsymbol {y -X \\beta})\\big]$$ Expand: $(\\boldsymbol {y -X \\beta})&#39;(\\boldsymbol {y -X \\beta}) = \\boldsymbol{y}&#39;\\boldsymbol{y} - 2\\boldsymbol{\\beta}&#39;\\boldsymbol{X}&#39;\\boldsymbol{y} + \\boldsymbol{\\beta}&#39;\\boldsymbol{X}&#39;\\boldsymbol{X}\\boldsymbol{\\beta}$ Differentiate w.r.t. $\\beta$: $\\dfrac{\\partial}{\\partial \\beta}[(\\boldsymbol {y -X \\beta})&#39;(\\boldsymbol {y -X \\beta})] = -2\\boldsymbol{X&#39;y} + 2\\boldsymbol{X&#39;X\\beta}$ Solve for $\\boldsymbol\\beta$: $$\\hat{\\boldsymbol\\beta} = (\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\boldsymbol y$$ MLE Since the errors are normal $\\varepsilon_i \\sim \\mathcal{N}(0, \\sigma^2)$, $$P(\\varepsilon_i)=\\frac{1}{\\sqrt{2\\pi\\sigma^2}}\\exp\\bigg(-\\frac{(\\varepsilon_i-0)^2}{2\\sigma^2}\\bigg) $$ The probability of observing $(x_i, y_i)$ given $\\boldsymbol\\beta$ is $$P(x_i,y_i|\\beta)=\\frac{1}{\\sqrt{2\\pi\\sigma^2}}\\exp\\bigg(-\\frac{(y_i-x_i\\beta)^2}{2\\sigma^2}\\bigg)$$ Thus the sample follows $$\\boldsymbol y \\sim \\mathcal{N}(\\boldsymbol X \\boldsymbol\\beta, \\sigma^2I)$$ The joint density (likelihood) of the whole sample is $$L(\\beta,\\sigma^2) = \\prod_{i=1}^n P(x_i,y_i|\\beta) = \\left(\\frac{1}{\\sqrt{2\\pi\\sigma^2}}\\right)^n\\exp\\bigg(-\\frac{1}{2\\sigma^2}(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta)&#39;(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta)\\bigg)$$ We want parameters that maximize $L(\\beta,\\sigma^2)$. For convenience, consider the log-likelihood $$ \\ell(\\beta,\\sigma^2) = \\log L(\\beta,\\sigma^2) = -\\frac{n}{2}\\log(2\\pi\\sigma^2) -\\frac{1}{2\\sigma^2}(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta)&#39;(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta) $$ Differentiating w.r.t. $\\beta$ and $\\sigma^2$ gives $\\dfrac{\\partial \\ell}{\\partial \\beta} = \\dfrac{1}{\\sigma^2}\\boldsymbol{X&#39;}(\\boldsymbol{y-X\\beta})$ $\\dfrac{\\partial \\ell}{\\partial \\sigma^2} = -\\dfrac{n}{2\\sigma^2} + \\dfrac{1}{(2\\sigma^2)^2}(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta)&#39;(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta)$ Solving yields $\\hat{\\boldsymbol\\beta} = (\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\boldsymbol y$ $\\hat\\sigma^2 = \\dfrac{1}{n}(\\boldsymbol y -\\boldsymbol X\\hat{\\boldsymbol\\beta})&#39;(\\boldsymbol y -\\boldsymbol X\\hat{\\boldsymbol\\beta}) = \\dfrac{1}{n}\\sum^n_i e_i^2$ $\\log L = -\\dfrac{n}{2}\\log(2\\pi\\hat\\sigma^2)-\\dfrac{1}{2\\sigma^2}(n\\hat\\sigma^2) = -\\dfrac{n}{2}\\log\\big(2\\pi\\tfrac{\\sum^n_i e_i^2}{n}\\big) - \\dfrac{n}{2} = -\\dfrac{n}{2}\\big[\\log(2\\pi)+\\log(\\tfrac{\\sum^n_i e_i^2}{n}) + 1\\big]$ Prediction When using the model for prediction, we need to account for several sources of uncertainty: Model error $Var(\\boldsymbol\\varepsilon)$: random noise in the system. Estimation error $Var(\\hat{\\boldsymbol\\beta})$: difference between the estimator $\\hat{\\boldsymbol\\beta}$ and the true parameter $\\boldsymbol \\beta$. Prediction error $Var(y^*)$: difference between the predicted value $y^*$ and the true outcome $y$. Model error Because the model is an abstraction of reality, error is unavoidable. Model error measures the discrepancy between true values and fitted (or predicted) values and is typically assessed via residuals. In practice, we estimate the error variance $Var(\\boldsymbol\\varepsilon) = \\sigma^2$ by the unbiased residual variance $$\\hat\\sigma^2 = \\frac{1}{n-k-1}(\\boldsymbol y -\\boldsymbol X\\hat{\\boldsymbol\\beta})&#39;(\\boldsymbol y -\\boldsymbol X\\hat{\\boldsymbol\\beta}) = \\frac{1}{n-k-1}\\sum^n_i e_i^2$$ Estimation error From the white-noise assumption we have $E(\\boldsymbol\\varepsilon) = 0$. $Var(\\boldsymbol\\varepsilon) = \\sigma^2$. Since $\\hat{\\boldsymbol\\beta}$ is an unbiased estimator of $\\boldsymbol\\beta$, we have $E(\\hat{\\boldsymbol\\beta}) = \\boldsymbol\\beta$. Substitute $\\boldsymbol {y = X\\beta + \\varepsilon}$ into $\\hat{\\boldsymbol\\beta} = (\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\boldsymbol y$: $$ \\begin{matrix}\\hat{\\boldsymbol\\beta} &amp;=&amp;(\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;(\\boldsymbol{X\\beta + \\varepsilon}) \\ &amp;=&amp;(\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\boldsymbol{X\\beta} +(\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\boldsymbol\\varepsilon\\ &amp;=&amp;\\boldsymbol{\\beta} +(\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\boldsymbol\\varepsilon\\end{matrix} $$ Then $$ \\begin{matrix}Var(\\hat{\\boldsymbol\\beta}) &amp;=&amp; E(\\hat{\\boldsymbol\\beta}^2) - E^2(\\hat{\\boldsymbol\\beta}) \\ &amp;=&amp; E(\\hat{\\boldsymbol\\beta}^2) - \\boldsymbol\\beta^2 \\ &amp;=&amp; E\\big[\\big(\\boldsymbol{\\beta} +(\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\boldsymbol\\varepsilon\\big)^2\\big]- \\boldsymbol\\beta^2 \\ &amp;=&amp; \\boldsymbol\\beta^2 + E\\big[\\big((\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\boldsymbol\\varepsilon\\big)^2\\big]- \\boldsymbol\\beta^2 \\ &amp;=&amp; \\big((\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\big)^2E(\\boldsymbol\\varepsilon^2)\\end{matrix} $$ Here $E(\\boldsymbol\\varepsilon^2) = Var(\\boldsymbol\\varepsilon) - E^2(\\boldsymbol\\varepsilon) = \\sigma^2$. $\\big((\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\big)^2 = (\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\boldsymbol X (\\boldsymbol{X&#39; X})^{&#39;-1} = (\\boldsymbol{X&#39; X})^{-1}$. Thus $$Var(\\hat{\\boldsymbol\\beta}) = \\hat\\sigma^2(\\boldsymbol{X&#39; X})^{-1}$$ Prediction error Given $\\hat{\\boldsymbol \\beta}$ and a predictor vector $\\boldsymbol x^*$, the predicted mean is $$\\hat{y}^* = E(y^*|\\boldsymbol y,\\boldsymbol X, \\boldsymbol x^*) = \\boldsymbol x^* \\hat{\\boldsymbol\\beta} = \\boldsymbol x^*(\\boldsymbol X&#39;\\boldsymbol X)^{-1}\\boldsymbol X&#39;\\boldsymbol y$$ Prediction uncertainty arises from the estimation error in $\\hat{\\boldsymbol \\beta}$, i.e. $Var(\\hat{\\boldsymbol\\beta}) = \\sigma^2(\\boldsymbol{X&#39; X})^{-1}$, so the variance of the predicted mean is $$Var(y^*|\\boldsymbol X, \\boldsymbol x^*) = Var(\\boldsymbol x^* \\hat{\\boldsymbol\\beta}|\\boldsymbol X) =\\boldsymbol x^Var(\\hat{\\boldsymbol\\beta}) (\\boldsymbol x^)&#39; = \\hat\\sigma^2\\boldsymbol x^*(\\boldsymbol X&#39;\\boldsymbol X)^{-1}(\\boldsymbol x^*)&#39;$$ In addition, we need to account for model error $Var(\\boldsymbol\\varepsilon) = \\sigma^2$ when predicting an individual observation: $$Var(y^*|\\boldsymbol X, \\boldsymbol x^*) + Var(\\boldsymbol\\varepsilon) = \\hat\\sigma^2 \\big[1 + \\boldsymbol x^*(\\boldsymbol X&#39;\\boldsymbol X)^{-1}(\\boldsymbol x^*)&#39; \\big]$$ From these variances we obtain the corresponding standard errors: Standard error of the predicted mean: $\\text{SE}_{\\text{mean}} = \\sqrt{Var(y^*|\\boldsymbol X, \\boldsymbol x^*) }$. Standard error of a predicted observation: $\\text{SE}_{\\text{obs}} = \\sqrt{Var(y^*|\\boldsymbol X, \\boldsymbol x^*) + Var(\\boldsymbol\\varepsilon)}$. Here we assume $\\boldsymbol x^*$ is known and do not treat it as a source of variance. If $\\boldsymbol x^*$ is itself estimated, its estimation error should also be included. Confidence intervals Predictions are always uncertain, and we quantify this uncertainty with confidence intervals. The theory behind this is the Central Limit Theorem: when the sample size is large enough, the sample mean is approximately normally distributed. The model prediction can be viewed as the mean of a normal distribution, and around this mean there is a symmetric probability interval. Via the quantile function, we connect probabilities and standard errors. A classic example is the 3-sigma rule: Probability of falling in $\\mu\\pm1\\sigma$ is 68.27%. Probability of falling in $\\mu\\pm2\\sigma$ is 95.45%. Probability of falling in $\\mu\\pm3\\sigma$ is 99.73%. When reporting model predictions, besides the point forecast $\\hat y$, we usually provide a confidence interval: $P\\big(y \\in [\\hat y \\pm1.64, \\text{SE}_{\\text{obs}}]\\big) \\approx 90%$. $P\\big(y \\in [\\hat y \\pm1.96, \\text{SE}_{\\text{obs}}]\\big) \\approx 95%$. $P\\big(y \\in [\\hat y \\pm2.57, \\text{SE}_{\\text{obs}}]\\big) \\approx 99%$. The width of the interval depends on: Confidence level: higher confidence → wider interval → lower apparent precision. Model accuracy: better models yield narrower intervals at the same confidence level. Numerical methods LU decomposition To compute $\\hat{\\boldsymbol\\beta} = (\\boldsymbol{X&#39; X})^{-1}\\boldsymbol X&#39;\\boldsymbol y$, we need the inverse of $\\boldsymbol X&#39;\\boldsymbol X$. However, not every matrix is invertible; only square matrices with non-zero determinants have inverses. When there is multicollinearity, $\\boldsymbol X&#39;\\boldsymbol X$ becomes singular and non-invertible. One common way to compute inverses is Gaussian elimination. Another is LU decomposition. To compute $A^{-1}$ via LU decomposition: Factor $A$ into a lower-triangular matrix $L$ and an upper-triangular matrix $U$: $$A \\to LU = \\begin{bmatrix}\\ell_{11}&amp;0&amp;0\\\\ell_{21}&amp;\\ell_{22}&amp;0\\\\ell_{31}&amp;\\ell_{32}&amp;\\ell_{33}\\end{bmatrix} \\times \\begin{bmatrix}u_{11}&amp;u_{12}&amp;u_{13}\\0&amp;u_{22}&amp;u_{23}\\0&amp;0&amp;u_{33}\\end{bmatrix}$$ Replace $A$ with $LU$ and note $$A^{-1}=(LU)^{-1} = U^{-1} L^{-1}$$ Solve for the inverses of the triangular matrices $L^{-1}$ and $U^{-1}$: $LL^{-1}=I \\to \\begin{bmatrix}\\ell_{11}&amp;0&amp;0\\\\ell_{21}&amp;\\ell_{22}&amp;0\\\\ell_{31}&amp;\\ell_{32}&amp;\\ell_{33}\\end{bmatrix} \\times \\begin{bmatrix}x_{11}&amp;0&amp;0\\x_{21}&amp;x_{22}&amp;0\\x_{31}&amp;x_{32}&amp;x_{33}\\end{bmatrix} = I$ $UU^{-1}=I \\to \\begin{bmatrix}u_{11}&amp;u_{12}&amp;u_{13}\\0&amp;u_{22}&amp;u_{23}\\0&amp;0&amp;u_{33}\\end{bmatrix} \\times \\begin{bmatrix}y_{11}&amp;y_{12}&amp;y_{13}\\0&amp;y_{22}&amp;y_{23}\\0&amp;0&amp;y_{33}\\end{bmatrix} = I$ LU itself is computed using Gaussian elimination and has time complexity $O(n^3)$. But once $L$ and $U$ are available, solving systems is cheaper than recomputing a full inverse from scratch. package main import ( &quot;fmt&quot; &quot;gonum.org/v1/gonum/blas&quot; &quot;gonum.org/v1/gonum/blas/blas64&quot; &quot;gonum.org/v1/gonum/lapack/lapack64&quot; ) func main() { swap := []int{0, 0, 0} work := []float64{0, 0, 0} // Decompose A = LU A := blas64.General{3, 3, []float64{ 2, -1, 0, -1, 2, -1, 0, -1, 2, }, 3} if ok := lapack64.Getrf(A, swap); !ok { panic(&quot;LU decomposition unstable&quot;) } // Print L fmt.Printf(&quot;[1 0 0]\\n[%.2f 1 0]\\n[%.2f %.2f 1]\\n&quot;, A.Data[3], A.Data[6], A.Data[7]) // Print U fmt.Printf(&quot;[%.2f %.2f %.2f]\\n[0 %.2f %.2f]\\n[0 0 %.2f]\\n&quot;, A.Data[0], A.Data[1], A.Data[2], A.Data[4], A.Data[5], A.Data[8]) // Solve LL⁻ = I with iteration // Solve UU⁻ = I with iteration // Calculate A⁻ = U⁻L⁻ if ok := lapack64.Getri(A, swap, work, len(work)); !ok { panic(&quot;LU inverse failed&quot;) } fmt.Printf(&quot;[%.2f %.2f %.2f]\\n[%.2f %.2f %.2f]\\n[%.2f %.2f %.2f]\\n&quot;, A.Data[0], A.Data[1], A.Data[2], A.Data[3], A.Data[4], A.Data[5], A.Data[6], A.Data[7], A.Data[8]) // Verify AxA⁻ = I B := blas64.General{3, 3, []float64{ 2, -1, 0, -1, 2, -1, 0, -1, 2, }, 3} C := blas64.General{3, 3, []float64{ 0, 0, 0, 0, 0, 0, 0, 0, 0, }, 3} blas64.Gemm(blas.NoTrans, blas.NoTrans, 1, A, B, 0, C) fmt.Printf(&quot;[%.2f %.2f %.2f]\\n[%.2f %.2f %.2f]\\n[%.2f %.2f %.2f]\\n&quot;, C.Data[0], C.Data[1], C.Data[2], C.Data[3], C.Data[4], C.Data[5], C.Data[6], C.Data[7], C.Data[8]) } QR decomposition In practice, directly inverting high-dimensional matrices is usually a bad idea: The inverse of a sparse matrix may be dense, increasing memory and computation cost. Floating-point operations accumulate numerical error, which can hurt stability. In most applications, including regression, we only need to solve a system like $\\boldsymbol{X&#39;X\\beta} = \\boldsymbol{X&#39;y}$; computing the inverse is just one way to do that. See for example: https://math.stackexchange.com/questions/3185211/what-does-qr-decomposition-have-to-do-with-least-squares-method Consider solving $A&#39;Ax=A&#39;b$ using QR decomposition: Factor $A$ as $A = QR$, where $Q$ is orthogonal and $R$ is upper triangular: $$A \\to QR = \\begin{bmatrix}a_{11}&amp;a_{12}&amp;a_{13}\\a_{21}&amp;a_{22}&amp;a_{23}\\a_{31}&amp;a_{32}&amp;a_{33}\\a_{41}&amp;a_{42}&amp;a_{43}\\end{bmatrix} \\to \\begin{bmatrix}q_{11}&amp;q_{12}&amp;q_{13}&amp;q_{14}\\q_{21}&amp;q_{22}&amp;q_{23}&amp;q_{24}\\q_{31}&amp;q_{32}&amp;q_{33}&amp;q_{34}\\q_{41}&amp;q_{42}&amp;q_{43}&amp;q_{44}\\end{bmatrix} \\times \\begin{bmatrix}r_{11}&amp;r_{12}&amp;r_{13}\\0&amp;r_{22}&amp;r_{23}\\0&amp;0&amp;r_{33}\\0&amp;0&amp;0\\end{bmatrix}$$ Substitute into the normal equations: $$(QR)&#39;(QR)x=(QR)&#39;b \\ \\to\\ R&#39;Q&#39;QRx=R&#39;Q&#39;b$$ Use $QQ&#39;=I$ to get $$R&#39;Rx=R&#39;Q&#39;b \\ \\to\\ Rx=Q&#39;b$$ Solve the triangular system for $x$: $$\\begin{bmatrix}r_{11}&amp;r_{12}&amp;r_{13}\\0&amp;r_{22}&amp;r_{23}\\0&amp;0&amp;r_{33}\\0&amp;0&amp;0\\end{bmatrix} \\times \\begin{bmatrix}x_{1}\\x_{2}\\x_{3}\\end{bmatrix} = \\begin{bmatrix}q_{11}b_1+q_{21}b_2+q_{31}b_3+q_{41}b_4\\q_{12}b_1+q_{22}b_2+q_{32}b_3+q_{42}b_4\\q_{13}b_1+q_{23}b_2+q_{33}b_3+q_{43}b_4\\q_{14}b_1+q_{24}b_2+q_{34}b_3+q_{44}b_4\\end{bmatrix}$$ With QR, the covariance of the estimator can be written as $$Var(\\hat{\\boldsymbol\\beta}) / \\sigma^2 = (\\boldsymbol{X&#39; X})^{-1} =((QR)&#39;QR)^{-1} = R^{-1}(Q&#39;Q)R&#39;^{-1} = R^{-1}R&#39;^{-1}=(R&#39;R)^{-1}$$ package main import ( &quot;fmt&quot; &quot;gonum.org/v1/gonum/blas&quot; &quot;gonum.org/v1/gonum/blas/blas64&quot; &quot;gonum.org/v1/gonum/lapack/lapack64&quot; ) func main() { y := []float64{2, 3, 5, 7, 10} X := []float64{ 1, 1, 10, 1, 1, 2, 8, 0, 1, 3, 9, 1, 1, 4, 7, 1, 1, 5, 6, 0, } A := blas64.General{5, 4, X, 4} b := blas64.Vector{4, y, 1} // Decompose A = QR QR := blas64.General{5, 4, make([]float64, len(A.Data)), 4} copy(QR.Data, A.Data) work := []float64{0} tau := make([]float64, min(QR.Rows, QR.Cols)) lapack64.Geqrf(QR, tau, work, -1) work = make([]float64, int(work[0])) lapack64.Geqrf(QR, tau, work, len(work)) // Restore Q R := blas64.Triangular{blas.Upper, blas.NonUnit, 4, QR.Data, 4} Q := blas64.General{5, 4, make([]float64, len(QR.Data)), 4} copy(Q.Data, QR.Data) lapack64.Orgqr(Q, tau, work, -1) work = make([]float64, int(work[0])) lapack64.Orgqr(Q, tau, work, len(work)) // Print Q fmt.Printf(&quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;, Q.Data[0], Q.Data[1], Q.Data[2], Q.Data[3], Q.Data[4], Q.Data[5], Q.Data[6], Q.Data[7], Q.Data[8], Q.Data[9], Q.Data[10], Q.Data[11], Q.Data[12], Q.Data[13], Q.Data[14], Q.Data[15], Q.Data[16], Q.Data[17], Q.Data[18], Q.Data[19]) // Print R fmt.Printf(&quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[0 %.2f %.2f %.2f]\\n&quot;+ &quot;[0 0 %.2f %.2f]\\n&quot;+ &quot;[0 0 0 %.2f]\\n&quot;, R.Data[0], R.Data[1], R.Data[2], R.Data[3], R.Data[5], R.Data[6], R.Data[7], R.Data[10], R.Data[11], R.Data[15]) // Calculate Qᵀb Qb := blas64.Vector{4, make([]float64, 4), 1} blas64.Gemv(blas.Trans, 1, Q, b, 0, Qb) // Solve Rx = Qᵀb x := blas64.General{4, 1, Qb.Data, 1} if ok := lapack64.Trtrs(blas.NoTrans, R, x); !ok { panic(&quot;Solve X failed&quot;) } // Print x fmt.Printf(&quot;[%.2f %.2f %.2f %.2f]\\n&quot;, x.Data[0], x.Data[1], x.Data[2], x.Data[3]) // Calculate RᵀR RR := blas64.General{4, 4, make([]float64, 16), 4} for i := 0; i &lt; R.N; i++ { for j := i; j &lt; R.N; j++ { RR.Data[i*R.N+j] = R.Data[i*R.N+j] } } blas64.Trmm(blas.Left, blas.Trans, 1, R, RR) // Calculate (RᵀR)⁻ swap := make([]int, 4) work = make([]float64, 4) if ok := lapack64.Getrf(RR, swap); !ok { panic(&quot;LU decomposition unstable&quot;) } if ok := lapack64.Getri(RR, swap, work, len(work)); !ok { panic(&quot;LU inverse failed&quot;) } // Print (RᵀR)⁻ fmt.Printf(&quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;, RR.Data[0], RR.Data[1], RR.Data[2], RR.Data[3], RR.Data[4], RR.Data[5], RR.Data[6], RR.Data[7], RR.Data[8], RR.Data[9], RR.Data[10], RR.Data[11], RR.Data[12], RR.Data[13], RR.Data[14], RR.Data[15]) // Calculate residual e beta := blas64.Vector{4, x.Data, 1} residual := blas64.Vector{len(y), make([]float64, len(y)), 1} copy(residual.Data, y) blas64.Gemv(blas.NoTrans, 1, A, beta, -1, residual) // Calculate unbiased variance σ² = Σe² / (n-k-1) freedomDeg := float64(len(y) - (beta.N - 1) - 1) unbiasedVar := blas64.Dot(residual, residual) / freedomDeg fmt.Printf(&quot;[%.2f %.2f %.2f %.2f %.2f] / %.2f -&gt; %.2f\\n&quot;, residual.Data[0], residual.Data[1], residual.Data[2], residual.Data[3], residual.Data[4], freedomDeg, unbiasedVar) // Predicate xStar := blas64.Vector{4, []float64{1, 2, 1, 2}, 1} yStar := blas64.Dot(beta, xStar) // Calculate (σ²(XᵀX)⁻)xᵀ xStarT := blas64.General{4, 1, xStar.Data, 1} predVar := blas64.General{beta.N, 1, make([]float64, beta.N), 1} blas64.Gemm(blas.NoTrans, blas.NoTrans, unbiasedVar, RR, xStarT, 0, predVar) // Calculate prediction variance x(σ²(XᵀX)⁻)xᵀ predVarT := blas64.Vector{beta.N, predVar.Data, 1} yVar := blas64.Dot(xStar, predVarT) fmt.Printf(&quot;%.2f (±%.2f)\\n&quot;, yStar, yVar) } SVD decomposition When a matrix is not invertible, we can use its pseudoinverse $A^+$ as a substitute. The pseudoinverse satisfies the Moore–Penrose conditions: $AA^+A = A$ $A^+AA^+ = A^+$ $(AA^+)^* = AA^+$ $(A^+A)^* = A^+A$ Even if we cannot invert $A$, we can still solve systems using $A^+$. One way to construct $A^+$ is via SVD. Factor a matrix $A$ as $A = U\\Sigma V^*$ where $U$ and $V$ are orthogonal and $\\Sigma$ is diagonal: $$ A \\to U\\Sigma V^* = \\begin{bmatrix}a_{11}&amp;a_{12}\\a_{21}&amp;a_{22}\\a_{31}&amp;a_{32}\\end{bmatrix} \\to \\begin{bmatrix}u_{11}&amp;u_{12}&amp;u_{13}\\u_{21}&amp;u_{22}&amp;u_{23}\\u_{31}&amp;u_{32}&amp;u_{33}\\end{bmatrix} \\times \\begin{bmatrix}\\sigma_{1}&amp;0\\0&amp;\\sigma_{2}\\0&amp;0\\end{bmatrix} \\times \\begin{bmatrix}\\nu_{11}&amp;\\nu_{12}\\\\nu_{21}&amp;\\nu_{22}\\end{bmatrix} $$ Construct the pseudoinverse of $\\Sigma$: $$ \\Sigma^+ = \\begin{bmatrix}\\frac{1}{\\sigma_{1}}&amp;0&amp;0\\0&amp;\\frac{1}{\\sigma_{2}}&amp;0\\end{bmatrix} $$ Then define the pseudoinverse of $A$: $$ A^+ = V\\Sigma^+ U^* = \\begin{bmatrix}\\nu_{11}&amp;\\nu_{21}\\\\nu_{12}&amp;\\nu_{22}\\end{bmatrix} \\times \\begin{bmatrix}\\frac{1}{\\sigma_{1}}&amp;0&amp;0\\0&amp;\\frac{1}{\\sigma_{2}}&amp;0\\end{bmatrix} imes \\begin{bmatrix}u_{11}&amp;u_{21}&amp;u_{31}\\u_{12}&amp;u_{22}&amp;u_{32}\\u_{13}&amp;u_{23}&amp;u_{33}\\end{bmatrix} $$ Implementation-wise, we handle two cases: If $\\dim(V) &lt; \\dim(U)$, $$\\begin{bmatrix}\\frac{\\nu_{11}}{\\sigma_1}&amp;\\frac{\\nu_{21}}{\\sigma_2}&amp;0\\\\frac{\\nu_{12}}{\\sigma_1}&amp;\\frac{\\nu_{22}}{\\sigma_2}&amp;0\\end{bmatrix} imes \\begin{bmatrix}u_{11}&amp;u_{21}&amp;u_{31}\\u_{12}&amp;u_{22}&amp;u_{32}\\u_{13}&amp;u_{23}&amp;u_{33}\\end{bmatrix} \\to \\begin{bmatrix}\\frac{\\nu_{11}}{\\sigma_1}&amp;\\frac{\\nu_{21}}{\\sigma_2}\\\\frac{\\nu_{12}}{\\sigma_1}&amp;\\frac{\\nu_{22}}{\\sigma_2}\\end{bmatrix} imes \\begin{bmatrix}u_{11}&amp;u_{21}\\u_{12}&amp;u_{22}\\u_{13}&amp;u_{23}\\end{bmatrix}$$ If $\\dim(U) &lt; \\dim(V)$, $$\\begin{bmatrix}\\nu_{11}&amp;\\nu_{21}&amp;\\nu_{31}\\\\nu_{12}&amp;\\nu_{22}&amp;\\nu_{32}\\\\nu_{13}&amp;\\nu_{23}&amp;\\nu_{33}\\end{bmatrix} \\times \\begin{bmatrix}\\frac{u_{11}}{\\sigma_1}&amp;\\frac{u_{21}}{\\sigma_1}\\\\frac{u_{12}}{\\sigma_2}&amp;\\frac{u_{22}}{\\sigma_2}\\0&amp;0\\end{bmatrix} \\to \\begin{bmatrix}\\nu_{11}&amp;\\nu_{21}\\\\nu_{12}&amp;\\nu_{22}\\\\nu_{13}&amp;\\nu_{23}\\end{bmatrix} \\times \\begin{bmatrix}\\frac{u_{11}}{\\sigma_1}&amp;\\frac{u_{21}}{\\sigma_1}\\\\frac{u_{12}}{\\sigma_2}&amp;\\frac{u_{22}}{\\sigma_2}\\end{bmatrix} $$ One can verify that $A^+$ satisfies the Moore–Penrose conditions: $AA^+A=U\\Sigma V^V\\Sigma^+U^U\\Sigma V^*=U\\Sigma\\Sigma^+\\Sigma V^*=U\\Sigma V^*=A$ $A^+AA^+=V\\Sigma^+U^U\\Sigma V^V\\Sigma^+=V\\Sigma^+\\Sigma\\Sigma^+V^*=V\\Sigma^+V^*=A^+$ $(AA^+)^*=(U\\Sigma V^V\\Sigma^+U^)^*=U\\Sigma\\Sigma^+U^*=AA^+$ $(A^+A)^*=(V\\Sigma^+U^U\\Sigma V^)^*=V\\Sigma^+\\SigmaV^*=A^+A$ From the derivation in this post, OLS admits the representation $$\\hat{\\boldsymbol\\beta} = (\\boldsymbol{X^* X})^+\\boldsymbol X^*\\boldsymbol y = \\boldsymbol X^+\\boldsymbol y$$ so we can estimate regression parameters via the pseudoinverse constructed from SVD. https://math.stackexchange.com/questions/4440503/moore-penrose-pseudoinverse-solves-the-least-squares-problem-svd-framework Using the pseudoinverse, we can simplify the covariance expression as $$ Var(\\hat{\\boldsymbol\\beta}) / \\sigma^2 = (\\boldsymbol{X&#39; X})^{-1} = (\\boldsymbol{X&#39; X})^{+}=\\boldsymbol X^+(\\boldsymbol X&#39;)^{+}=\\boldsymbol X^+(\\boldsymbol X^{+}){&#39;} $$ package main import ( &quot;fmt&quot; &quot;gonum.org/v1/gonum/blas&quot; &quot;gonum.org/v1/gonum/blas/blas64&quot; &quot;gonum.org/v1/gonum/lapack&quot; &quot;gonum.org/v1/gonum/lapack/lapack64&quot; ) func main() { y := []float64{2, 3, 5, 7, 10} X := []float64{ 1, 1, 10, 1, 1, 2, 8, 0, 1, 3, 9, 1, 1, 4, 7, 1, 1, 5, 6, 0, } A := blas64.General{5, 4, X, 4} b := blas64.Vector{4, y, 1} // Decompose A = UΣVᵀ SVD := blas64.General{5, 4, make([]float64, len(A.Data)), 4} copy(SVD.Data, A.Data) U := blas64.General{A.Rows, A.Rows, make([]float64, A.Rows*A.Rows), A.Rows} V := blas64.General{A.Cols, A.Cols, make([]float64, A.Cols*A.Cols), A.Cols} S := make([]float64, min(A.Rows, A.Cols)) work := []float64{0} lapack64.Gesvd(lapack.SVDAll, lapack.SVDAll, SVD, U, V, S, work, -1) work = make([]float64, int(work[0])) if ok := lapack64.Gesvd(lapack.SVDAll, lapack.SVDAll, SVD, U, V, S, work, len(work)); !ok { panic(&quot;SVD decomposition failed&quot;) } // Print U fmt.Printf(&quot;[%.2f %.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f %.2f]\\n&quot;, U.Data[0], U.Data[1], U.Data[2], U.Data[3], U.Data[4], U.Data[5], U.Data[6], U.Data[7], U.Data[8], U.Data[9], U.Data[10], U.Data[11], U.Data[12], U.Data[13], U.Data[14], U.Data[15], U.Data[16], U.Data[17], U.Data[18], U.Data[19], U.Data[20], U.Data[21], U.Data[22], U.Data[23], U.Data[24]) // Print Σ fmt.Printf(&quot;[%.2f 0 0 0 0]\\n&quot;+ &quot;[0 %.2f 0 0 0]\\n&quot;+ &quot;[0 0 %.2f 0 0]\\n&quot;+ &quot;[0 0 0 %.2f 0]\\n&quot;+ &quot;[0 0 0 0 0]\\n&quot;, S[0], S[1], S[2], S[3]) // Print Vᵀ fmt.Printf(&quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;, V.Data[0], V.Data[1], V.Data[2], V.Data[3], V.Data[4], V.Data[5], V.Data[6], V.Data[7], V.Data[8], V.Data[9], V.Data[10], V.Data[11], V.Data[12], V.Data[13], V.Data[14], V.Data[15]) // Calculate Σ⁺ for i := 0; i &lt; len(S); i++ { if S[i] &gt; 0 { S[i] = 1 / S[i] } } if A.Rows &gt; A.Cols { // Calculate V = (Σ⁺ᵀVᵀ)ᵀ for i := 0; i &lt; len(V.Data); i++ { V.Data[i] *= S[i/A.Cols] } U.Cols = A.Cols // trim U } else { // Calculate U = (UΣ⁺ᵀ)ᵀ for i := 0; i &lt; len(U.Data); i++ { U.Data[i] *= S[i%A.Rows] } V.Rows = A.Rows // trim V } // Calculate A⁺ = VΣ⁺Uᵀ = Vᵀ x Uᵀ INV := blas64.General{A.Cols, A.Rows, make([]float64, A.Cols*A.Rows), A.Rows} blas64.Gemm(blas.ConjTrans, blas.ConjTrans, 1, V, U, 0, INV) fmt.Printf(&quot;[%.2f %.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f %.2f]\\n&quot;, INV.Data[0], INV.Data[1], INV.Data[2], INV.Data[3], INV.Data[4], INV.Data[5], INV.Data[6], INV.Data[7], INV.Data[8], INV.Data[9], INV.Data[10], INV.Data[11], INV.Data[12], INV.Data[13], INV.Data[14], INV.Data[15], INV.Data[16], INV.Data[17], INV.Data[18], INV.Data[19]) // Calculate x = A⁺b x := blas64.Vector{4, make([]float64, 4), 1} blas64.Gemv(blas.NoTrans, 1, INV, b, 0, x) // Print x fmt.Printf(&quot;[%.2f %.2f %.2f %.2f]\\n&quot;, x.Data[0], x.Data[1], x.Data[2], x.Data[3]) // Calculate A⁺A⁺ᵀ AA := blas64.General{4, 4, make([]float64, 16), 4} blas64.Gemm(blas.NoTrans, blas.Trans, 1, INV, INV, 0, AA) // Print A⁺A⁺ᵀ fmt.Printf(&quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;+ &quot;[%.2f %.2f %.2f %.2f]\\n&quot;, AA.Data[0], AA.Data[1], AA.Data[2], AA.Data[3], AA.Data[4], AA.Data[5], AA.Data[6], AA.Data[7], AA.Data[8], AA.Data[9], AA.Data[10], AA.Data[11], AA.Data[12], AA.Data[13], AA.Data[14], AA.Data[15]) // Calculate residual e beta := blas64.Vector{4, x.Data, 1} residual := blas64.Vector{len(y), make([]float64, len(y)), 1} copy(residual.Data, y) blas64.Gemv(blas.NoTrans, 1, A, beta, -1, residual) // Calculate unbiased variance σ² = Σe² / (n-k-1) freedomDeg := float64(len(y) - (beta.N - 1) - 1) unbiasedVar := blas64.Dot(residual, residual) / freedomDeg fmt.Printf(&quot;[%.2f %.2f %.2f %.2f %.2f] / %.2f -&gt; %.2f\\n&quot;, residual.Data[0], residual.Data[1], residual.Data[2], residual.Data[3], residual.Data[4], freedomDeg, unbiasedVar) // Predicate xStar := blas64.General{2, 4, []float64{ 1, 2, 1, 2, 1, 2, 2, 1, }, 4} yStar := blas64.Vector{2, []float64{0, 0}, 1} blas64.Gemv(blas.NoTrans, 1, xStar, beta, 0, yStar) // Calculate prediction variance x(σ²(XᵀX)⁻)xᵀ predVar := blas64.General{xStar.Rows, beta.N, make([]float64, beta.N*xStar.Rows), beta.N} blas64.Gemm(blas.NoTrans, blas.Trans, unbiasedVar, xStar, AA, 0, predVar) yVar := make([]float64, 2) for i := 0; i &lt; len(predVar.Data); i++ { yVar[i/beta.N] += predVar.Data[i] * xStar.Data[i] } fmt.Printf(&quot;%.2f ± %.2f, %.2f ± %.2f\\n&quot;, yStar.Data[0], yVar[0], yStar.Data[1], yVar[1]) } Model diagnostics Coefficient of determination The coefficient of determination, or $R^2$, measures how well the regression model fits the data. Total sum of squares: $\\text{SS}_{\\text{tot}} = \\sum^n_i( y_i-\\bar y)^2$ Residual sum of squares: $\\text{SS}_{\\text{res}} = \\sum^n_i(y_i-\\hat y_i)^2 = \\sum_i e_i^2$ Coefficient of determination: $R^2 = 1 - \\dfrac{\\text{SS}{\\text{res}}/n}{\\text{SS}{\\text{tot}}/n} = 1 - \\dfrac{\\text{SS}{\\text{res}}}{\\text{SS}{\\text{tot}}}$ Interpretation: $\\text{SS}_{\\text{tot}}$: total variation in the data. $\\text{SS}_{\\text{res}}$: variation in the response not explained by the model. $R^2$: proportion of variance in the response explained by the model. The range of $R^2$ is $[0, 1]$: $R^2 = 1$: perfect fit; the model explains all variation in $y$. $R^2 = 0$: the model explains none of the variation. Adding irrelevant predictors always increases $R^2$, which can lead to overly complex and overfitted models. To penalize unnecessary parameters, we introduce degrees of freedom: Sample degrees of freedom: $n-1$. Residual degrees of freedom: $n-k-1$ (subtract $k$ regressors and 1 intercept). This leads to the adjusted $R^2$: $$\\bar R^2 = 1 - \\frac{\\text{SS}{\\text{res}}/(n-k-1)}{\\text{SS}{\\text{tot}}/(n-1)} = 1-\\frac{(1-R^2)(n-1)}{n-k-1}$$ Information criteria In practice, we often compare multiple candidate models. Information criteria provide a quantitative way to trade off goodness of fit and model complexity. Goodness of fit is usually represented by the log-likelihood. For linear regression, $$\\log(L) = -\\tfrac{n}{2}\\big(\\log(2\\pi)+\\log(\\tfrac{\\text{SS}_{\\text{res}}}{n}) + 1\\big)$$ Three common information criteria are: Akaike Information Criterion (AIC) $$AIC = -2\\log(L)+2k$$ Bayesian Information Criterion (BIC) $$BIC = -2\\log(L)+\\log(n)k$$ Corrected Akaike Information Criterion (AICc) $$AIC_C = AIC + \\frac{2k(k+1)}{n-k-1}$$ AIC: balances likelihood and number of parameters $k$, and is widely used for large samples. BIC: imposes a stronger penalty on model complexity and tends to favor more parsimonious models. It is popular when overfitting is a concern or when we believe the data come from some true underlying model. AICc: corrects AIC for small sample sizes, where AIC can favor overly complex models. If we mainly care about predictive performance rather than whether a particular model is “true”, AIC is often preferred; BIC is more common in econometrics. Condition number See for example: https://www.cnblogs.com/daniel-D/p/3219802.html Significance tests Once we have estimated coefficients, we can use hypothesis tests to assess model parsimony and variable importance. t-test The t-statistic is used to test whether a single predictor has a significant effect on the response. Hypotheses: $H_0 : \\beta = 0$ (the predictor has no effect). $H_1 : \\beta \\ne 0$ (the predictor has an effect). t-statistic: $t = \\dfrac{\\hat\\beta - \\beta_0}{SE(\\hat\\beta)}$ $\\hat\\beta$: estimated coefficient. $\\beta_0$: hypothesized value (usually 0). $SE(\\hat\\beta)$: standard error of $\\hat\\beta$. Test procedure: Compute the degrees of freedom $n-k-1$. Look up the critical value for the chosen significance level and df. Perform a two-sided test and decide whether to reject $H_0$. Interpretation: Rejecting $H_0$ implies the predictor has a statistically significant effect. Failing to reject $H_0$ suggests the effect is not statistically significant. F-test The F-statistic is used to test the overall significance of the regression model. Residual diagnostics Durbin–Watson test: detects autocorrelation, especially first-order serial correlation. Shapiro–Wilk test: tests normality of residuals. Levene or Bartlett test: tests homogeneity of variance across groups. "},{"slug":"stl-decomposition","title":"STL Decomposition","tags":["Statistics","TimeSeriesAnalysis"],"content":"STL (Seasonal-Trend decomposition using Loess) is a robust time series decomposition algorithm. It uses Loess smoothing to accurately decompose a series into three components: trend, seasonality, and remainder, and is widely used for anomaly detection and forecasting. Model specification Time series typically have the following features: Trend Trend: long-term upward or downward movement. Seasonality Seasonal: fluctuations with a fixed frequency, usually within one year, with clear repeating patterns (e.g. temperature, tourist volume). Cycle Cyclic: fluctuations with non-fixed frequency, usually spanning more than one year, and the cycle length may change over time (e.g. macroeconomy). Decomposing a time series helps us understand it better. A series is usually decomposed into three parts: Trend–cycle component trend-cycle. Seasonal component seasonal. Remainder component remainder. There are two common decomposition forms: Additive: $$y_t = S_t + T_t + R_t$$ Multiplicative: $$y_t = S_t \\times T_t \\times R_t$$ When the magnitude of seasonal fluctuations or trend–cycle changes is unrelated to the level of the series, use an additive decomposition. When the magnitude of seasonal fluctuations or trend–cycle changes is proportional to the level of the series, use a multiplicative decomposition. You can first stabilize the series via a log transform and then apply additive decomposition on the transformed series: $$y_t = S_t \\times T_t \\times R_t \\ \\ \\to\\ \\ \\ \\log y_t = \\log S_t + \\log T_t + \\log R_t$$ If seasonality itself is not of interest, you can remove the seasonal component from the original data to obtain a seasonally adjusted series. For example, seasonally adjusting monthly unemployment data emphasizes changes in the underlying economic conditions rather than seasonal effects. Classical decomposition Use moving averages to obtain the trend component $$\\hat{T}_t$$. Remove the trend to obtain the de-trended series $$y_t - \\hat{T}_t $$ or $$y_t / \\hat{T}_t$$. Compute the seasonal component $$\\hat{S}_t$$ based on the de-trended series. Remove the seasonal component to obtain the remainder $$\\hat{R}_t$$. Trend component Let $m$ be the seasonal period. We can use moving averages to smooth out random noise and obtain the trend. When $m$ is odd, compute an $m\\text{-MA}$: $$\\hat{T}t = \\frac{1}{m}\\sum{j=-k}^k y_{t+j}, \\quad k = \\frac{m-1}{2}$$ When $m$ is even, compute a $(2\\times m)\\text{-MA}$: $$\\hat{T}t = \\frac{1}{2m}\\sum{j=-k}^{k-1}y_{t+j} + \\frac{1}{2m}\\sum_{j=-(k-1)}^k y_{t+j}, \\quad k = \\frac{m}{2}$$ Seasonal component If the seasonal period is $m$, take the average of de-trended values within each season. This yields $m$ seasonal factors $$S^{(1)}, \\ldots , S^{(m)} $$, which must satisfy: Additive seasonality: $$S^{(1)} + \\cdots + S^{(m)} = 0 $$ Multiplicative seasonality: $$S^{(1)} + \\cdots + S^{(m)} = m$$ package main import &quot;fmt&quot; func main() { trend := []float64{1, 2, 3, 4, 5, 6, 7, 8, 9} seasonal := []float64{1, 3, 9, 1, 3, 9, 1, 3, 9} remainder := []float64{0.47, 0.12, 0.33, 0.03, 0.18, 0.14, 0.1, 0.33, 0.45} mulTs := make([]float64, 9) addTs := make([]float64, 9) for i := 0; i &lt; 9; i++ { mulTs[i] = trend[i] * seasonal[i] * remainder[i] addTs[i] = trend[i] + seasonal[i] + remainder[i] } at, as, ar := decomposeClassic(addTs, 3, true) fmt.Printf(&quot;%v\\n&quot;, at) fmt.Printf(&quot;%v\\n&quot;, as) fmt.Printf(&quot;%v\\n&quot;, ar) mt, ms, mr := decomposeClassic(mulTs, 3, false) fmt.Printf(&quot;%v\\n&quot;, mt) fmt.Printf(&quot;%v\\n&quot;, ms) fmt.Printf(&quot;%v\\n&quot;, mr) } func decomposeClassic(ts []float64, period int, additive bool) (trend, seasonal, residual []float64) { trend = movingAvg(ts, period) deTrended := make([]float64, len(trend)) k := (len(ts) - len(trend)) / 2 if additive { for i, v := range ts[k : len(ts)-k] { deTrended[i] = v - trend[i] } } else { for i, v := range ts[k : len(ts)-k] { deTrended[i] = v / trend[i] } } summary := 0. seasonal = make([]float64, period) for i := 0; i &lt; period; i++ { n, m := 0., 0 for j := i; j &lt; len(deTrended); j += period { n += deTrended[j] m++ } seasonal[(i+k)%period] = n / float64(m) summary += seasonal[(i+k)%period] } mean := summary / float64(period) if additive { for i := 0; i &lt; period; i++ { seasonal[i] -= mean } } else { for i := 0; i &lt; period; i++ { seasonal[i] /= mean } } residual = make([]float64, len(trend)) if additive { for i := 0; i &lt; len(residual); i++ { residual[i] = deTrended[i] - seasonal[(i+k)%period] } } else { for i := 0; i &lt; len(residual); i++ { residual[i] = ts[i+k] / trend[i] / seasonal[(i+k)%period] } } return trend, seasonal, residual } func movingAvg(x []float64, m int) []float64 { k := m / 2 y := make([]float64, len(x)-k*2) if m%2 == 1 { // m-MA for i := k; i &lt; k+len(y); i++ { n := 0. for j := -k; j &lt;= k; j++ { n += x[i+j] } y[i-k] = n / float64(m) } } else { // (2 x m)-MA for i := k; i &lt; k+len(y); i++ { n := (x[i-k] + x[i+k]) / 2 for j := 1 - k; j &lt;= k-1; j++ { n += x[i+j] } y[i-k] = n / float64(m) } } return y } Loess When fitting nonlinear data with linear regression, we often try to enhance the model with: Interaction terms: $$y = \\beta_0 + \\beta_1x_1+ \\beta_2x_2+ \\beta_3x_1x_2$$ Higher-order terms: $$y = \\beta_0 + \\beta_1x_1+ \\beta_2x_1^2+ \\beta_3 x_1^{-1}$$ Transcendental terms: $$y = \\beta_0 + \\beta_1x_1+ \\beta_2\\log x_1 + \\beta_3 \\sin x_1$$ However, all of these approaches require manual feature engineering (designing, selecting, and combining features), which has several drawbacks: The feature construction process is tedious and not easily automated; it becomes inefficient when there are many datasets. Manually designed features cannot easily adapt to changing data; once the data distribution shifts, previous features may no longer work. Locally Weighted Regression (LWR) was proposed to address these issues. Its main advantages are: Automation: no parametric assumptions and no need for manual feature engineering. Stability: locally adaptive, robust to changes in data distribution and to outliers. LWR is a nonparametric regression method for modeling nonlinear relationships. The model does not contain explicit parameters $\\boldsymbol \\beta$ in the traditional sense; instead, it uses the sample set $(\\boldsymbol X,\\boldsymbol y)$ directly for prediction. Its core idea is to assign a weight function $w_i(x)$ to each sample $(x_i,y_i)$ and predict via $$y = w_1(x)y_1 + \\cdots + w_n(x)y_n = \\sum_{i=1}^n w_i(x) y_i$$ This allows the model to better fit local structure in the data and improve predictive accuracy. The weight function $w_i(x)$ measures the distance between the input $x$ and the training point $x_i$: The closer $x$ is to $x_i$, the larger the weight on $y_i$. The farther $x$ is from $x_i$, the smaller the weight on $y_i$. Common choices for $w_i(x)$ include: Gaussian: $w_i(x) = \\exp\\bigg(-\\frac{(x - x_i)^2}{2\\tau^2}\\bigg)$ Tri-cubic: $w_i(x) = \\begin{cases} \\left[1 - \\left(\\frac{|x - x_i|}{h}\\right)^3\\right]^3, &amp; \\text{if } |x - x_i| \\le h \\ 0, &amp; \\text{otherwise} \\end{cases}$ Bi-square: $w_i(x) = \\begin{cases} \\left[1 - \\left(\\frac{|x - x_i|}{h}\\right)^2\\right]^2, &amp; \\text{if } |x - x_i| \\le h \\ 0, &amp; \\text{otherwise} \\end{cases}$ The hyperparameters $\\tau$ and $h$ are called the bandwidth: Larger bandwidth: weights decay more slowly, the influence range is wider. Smaller bandwidth: weights decay faster, the influence range is narrower. Bandwidth has a direct impact on the final fit: Larger bandwidth: smoother curve, but may underfit. Smaller bandwidth: rougher curve, but may overfit. Parameter estimation LWR itself is nonparametric, but for a given query point $x$ we can compute the weights $w_1(x), \\cdots, w_n(x)$. Then, using OLS or MLE, we estimate a local parameter vector $\\hat{\\boldsymbol\\beta}$. Each time $x$ changes, we recompute $\\hat{\\boldsymbol\\beta}$. OLS We can view LWR as fitting a local linear model around each point $(x_i,y_i)$ by minimizing a weighted squared error. The weighted residual sum of squares is $$\\sum_{i=1}^n w_i(x)e_i^2 = \\sum_{i=1}^n w_i(x) (y_i - x_i^T \\boldsymbol\\beta)^2$$ In matrix form, $(\\boldsymbol {y -X \\beta})&#39;\\boldsymbol W(\\boldsymbol {y -X \\beta})$ $\\boldsymbol y = (y_1, \\ldots, y_n)$ $\\boldsymbol \\beta = (\\beta_1, \\ldots, \\beta_k)$ $\\boldsymbol X = \\begin{bmatrix} 1 &amp; x_{1,1} &amp; \\cdots &amp; x_{k,1} \\ 1 &amp; x_{1,2} &amp; \\cdots &amp; x_{k,2} \\ \\vdots &amp; \\vdots &amp; \\cdots &amp; \\vdots \\ 1 &amp; x_{1,n} &amp; \\cdots &amp; x_{k,n} \\ \\end{bmatrix}$ $\\boldsymbol W = \\begin{bmatrix} w_1(x) &amp; 0 &amp;\\cdots &amp; 0 \\ 0 &amp; w_2(x) &amp; \\cdots &amp; 0 \\ \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\ 0 &amp; 0 &amp;\\cdots &amp; w_n(x) \\ \\end{bmatrix}$ Thus LWR reduces to a weighted least squares problem: find $\\boldsymbol\\beta$ that minimizes $$\\min_\\beta\\big[(\\boldsymbol {y -X \\beta})&#39;\\boldsymbol W(\\boldsymbol {y -X \\beta})\\big]$$ Expand: $$(\\boldsymbol {y -X \\beta})&#39;\\boldsymbol W(\\boldsymbol {y -X \\beta}) = \\boldsymbol{y}&#39;\\boldsymbol{W}\\boldsymbol{y} - 2\\boldsymbol{\\beta}&#39;\\boldsymbol{X}&#39;\\boldsymbol{W}\\boldsymbol{y} + \\boldsymbol{\\beta}&#39;\\boldsymbol{X}&#39;\\boldsymbol{W}\\boldsymbol{X}\\boldsymbol{\\beta}$$ Differentiate w.r.t. $\\boldsymbol\\beta$: $$\\frac{\\partial}{\\partial \\boldsymbol{\\beta}} (\\boldsymbol {y -X \\beta})&#39;\\boldsymbol W(\\boldsymbol {y -X \\beta}) = -2\\boldsymbol{X}&#39;\\boldsymbol{W}\\boldsymbol{y} + 2\\boldsymbol{X}&#39;\\boldsymbol{W}\\boldsymbol{X}\\boldsymbol{\\beta}$$ Solve for $\\boldsymbol\\beta$: $$\\hat{\\boldsymbol\\beta} = (\\boldsymbol{X&#39;WX})^{-1}\\boldsymbol{X&#39;Wy}$$ MLE Assume $(\\boldsymbol X, \\boldsymbol y)$ follow a normal model $\\boldsymbol y \\sim \\mathcal{N}(\\boldsymbol X \\boldsymbol\\beta, \\sigma^2I)$. Then the joint density (likelihood) of the sample is $$L(\\beta,\\sigma^2) = \\prod_{i=1}^n P(x_i,y_i|\\beta) = \\left(\\frac{1}{\\sqrt{2\\pi\\sigma^2}}\\right)^n\\exp\\bigg(-\\frac{1}{2\\sigma^2}(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta)&#39;\\boldsymbol W(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta)\\bigg)$$ Taking logs gives the log-likelihood $$\\ell(\\beta,\\sigma^2) = \\log L(\\beta,\\sigma^2) = -\\frac{n}{2}\\log(2\\pi\\sigma^2)-\\frac{1}{2\\sigma^2}(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta)&#39;\\boldsymbol W(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta)$$ Differentiate w.r.t. $\\beta$ and $\\sigma^2$: $$\\frac{\\partial \\ell }{\\partial \\beta} = \\frac{1}{\\sigma^2}\\boldsymbol{X&#39;W}(\\boldsymbol{y-X\\beta})$$ $$\\frac{\\partial \\ell }{\\partial \\sigma^2} = -\\frac{n}{2\\sigma^2} + \\frac{1}{(2\\sigma^2)^2}(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta)&#39;\\boldsymbol W(\\boldsymbol y -\\boldsymbol X\\boldsymbol\\beta)$$ Solving yields $\\hat{\\boldsymbol\\beta} = (\\boldsymbol{X&#39;WX})^{-1}\\boldsymbol{X&#39;Wy}$ $\\hat\\sigma^2 = \\frac{1}{n}(\\boldsymbol y -\\boldsymbol X\\hat{\\boldsymbol\\beta})&#39;\\boldsymbol W(\\boldsymbol y -\\boldsymbol X\\hat{\\boldsymbol\\beta}) = \\frac{1}{n}\\sum^n_i w_i(x)e_i^2$ $\\log L = -\\frac{n}{2}\\bigg[\\log(2\\pi)+\\log\\Big(\\frac{\\sum^n_i w_i(x)e_i^2}n\\Big) + 1\\bigg]$ Sample code package main import ( &quot;fmt&quot; &quot;gonum.org/v1/gonum/blas/blas64&quot; &quot;gonum.org/v1/gonum/mat&quot; &quot;math&quot; ) type Gaussian struct { tau float64 } func (g Gaussian) Weight(x blas64.General, x0 blas64.Vector) []float64 { weight := make([]float64, x.Rows) for i := 0; i &lt; x.Rows; i++ { sum, pos, end := 0., i*x.Stride, x0.Inc*x0.N for j := 0; j &lt; end; j += x0.Inc { v := x.Data[pos] - x0.Data[j] sum += v * v pos++ } weight[i] = math.Exp(-math.Sqrt(sum) / 2 * g.tau * g.tau) } return weight } func main() { y := mat.NewVecDense(5, []float64{2, 3, 5, 7, 10}) X := mat.NewDense(5, 4, []float64{ 1, 1, 10, 1, 1, 2, 8, 0, 1, 3, 9, 1, 1, 4, 7, 1, 1, 5, 6, 0, }) x0 := mat.NewVecDense(4, []float64{1, 2, 1, 2}) w := Gaussian{1}.Weight(X.RawMatrix(), x0.RawVector()) W := mat.NewDiagDense(len(w), w) fmt.Printf(&quot;%v\\n&quot;, w) n, k := X.Dims() XW := mat.NewDense(k, n, make([]float64, k*n)) XW.Mul(X.T(), W) XWY := mat.NewVecDense(k, make([]float64, k)) XWY.MulVec(XW, y) XWX := mat.NewDense(k, k, make([]float64, k*k)) XWX.Mul(XW, X) INV := mat.NewDense(k, k, make([]float64, k*k)) if err := INV.Inverse(XWX); err != nil { panic(err) } beta := mat.NewVecDense(k, make([]float64, k)) beta.MulVec(INV, XWY) fmt.Printf(&quot;%v\\n&quot;, mat.Formatted(beta)) y0 := mat.Dot(beta, x0) fmt.Printf(&quot;%v -&gt; %v\\n&quot;, mat.Formatted(x0.T()), y0) } STL decomposition STL (Seasonal-Trend decomposition using LOESS) estimates trend and seasonal components using local regression (LOESS). Compared with classical decomposition, STL has several advantages: Better handles nonlinear trends and non-stationary seasonality. Supports non-integer seasonal periods, offering greater flexibility. Robust to outliers, reducing their impact on the decomposition. LOESS smoothing A key improvement in STL over classical decomposition is replacing simple moving averages with LOESS. This makes the method more sensitive to local features and helps preserve important details in the data. q-neighbourhood weights Let $\\lambda$ denote distances between points, sorted from nearest to farthest as $\\lambda_1(x), \\ldots, \\lambda_n(x)$. Define the q-neighbourhood distance $\\lambda_q(x)$ as: If $q \\le n$, then $\\lambda_q(x) = \\lambda_n(x)$. If $q &gt; n$, then $\\lambda_q(x) = \\lambda_n(x)\\frac{q}{n}$. Use $\\lambda_q(x)$ as the bandwidth $h$ in the tri-cubic kernel to obtain the q-neighbourhood weight: $$w_i(x) = \\begin{cases} \\left[1 - \\left(\\frac{|x - x_i|}{\\lambda_q(x)}\\right)^3\\right]^3, &amp; \\text{if } |x - x_i| &lt; \\lambda_q(x) \\ 0, &amp; \\text{otherwise} \\end{cases}$$ If $|x_i - x| \\ge \\lambda_q(x)$, then $w_i(x) = 0$. If $|x_i - x| &lt; \\lambda_q(x)$, then $w_i(x) &gt; 0$. d-th order polynomial fit After obtaining the weights via $q$, we fit a local polynomial to the data. Depending on the shape of the curve, different polynomial orders can be used: For relatively smooth curves, use a first-order polynomial (locally linear fit). For more curved patterns, use a second-order polynomial (locally quadratic fit). In practice, STL usually uses a first-order polynomial. Overall design STL consists of two nested loops: Inner loop: updates seasonal and trend components using LOESS smoothing. Outer loop: runs the inner loop and updates robustness weights. Robustness weights reduce the influence of transient outliers on the seasonal and trend components. STL is additive by default, but a multiplicative decomposition can be obtained by applying a log or Box–Cox transform to the original data. Algorithm parameters Period length The series is split into cycle-subseries according to the seasonal period $n_{(p)}$. LOESS smoothing parameters $n_{(s)}$: span for smoothing the seasonal component. $n_{(t)}$: span for smoothing the trend component. $n_{(\\ell)}$: span for the low-pass filter. Iteration counts $n_{(i)}$: number of inner iterations. $n_{(o)}$: number of outer iterations. Inner loop For the first inner loop, initialize the trend as $T^{(0)} = 0$. After the $k$-th inner loop, we have seasonal and trend components $S^{(k)}$ and $T^{(k)}$. The $(k+1)$-th inner loop consists of: Detrending Compute the de-trended series $Y - T^{(k)}$. Cycle-subseries smoothing Apply LOESS with parameters $q = n_{(s)}, d=1$ to the de-trended series $Y - T^{(k)}$. The smoothed cycle-subseries form a temporary seasonal series $C^{(k+1)}$. Low-pass filtering of smoothed cycle-subseries Apply a low-pass filter to $C^{(k+1)}$ to obtain the low-frequency component $L^{(k+1)}$: Two moving averages with window length $n_{(p)}$. One moving average with window length 3. One LOESS smoothing with $q = n_{(\\ell)}, d=1$. Detrending of smoothed cycle-subseries Subtract the low-frequency component to get the pure seasonal component $$S^{(k+1)} = C^{(k+1)} - L^{(k+1)}$$ Deseasonalizing Compute the seasonally adjusted series $Y - S^{(k+1)}$. Trend smoothing Apply LOESS with $q = n_{(t)}, d=1$ to the seasonally adjusted series $Y - S^{(k+1)}$. This yields the new trend $T^{(k+1)}$. Outer loop After finishing the inner loop, compute the remainder $$R = Y - T - S$$ For outliers, the absolute value $|R|$ will be large. Define the outlier threshold as $h = 6, \\text{median}(|R|)$ and use it as the bandwidth of a bi-square weight to obtain robustness weights: $$\\rho_i = \\begin{cases} \\left[1 - \\left(\\frac{|R_i|}{h}\\right)^2\\right]^2, &amp; \\text{if } |R_i| &lt; h \\ 0, &amp; \\text{otherwise} \\end{cases}$$ If $|R_i| \\ge h$, treat the observation as an outlier and set its weight $\\rho_i = 0$. Implementation The reference CPython/Cython implementation is: https://github.com/statsmodels/statsmodels/blob/main/statsmodels/tsa/stl/_stl.pyx It follows the original Fortran code from the paper. Instead of performing full LOESS at every point, it combines LOESS and linear interpolation to improve computational efficiency. package main import ( &quot;errors&quot; &quot;fmt&quot; &quot;math&quot; &quot;slices&quot; ) func main() { co2 := []float64{315.58, 316.39, 316.79, 317.82, 318.39, 318.22, 316.68, 315.01, 314.02, 313.55, 315.02, 315.75, 316.52, 317.1, 317.79, 319.22, 320.08, 319.7, 318.27, 315.99, 314.24, 314.05, 315.05, 316.23, 316.92, 317.76, 318.54, 319.49, 320.64, 319.85, 318.7, 316.96, 315.17, 315.47, 316.19, 317.17, 318.12, 318.72, 319.79, 320.68, 321.28, 320.89, 319.79, 317.56, 316.46, 315.59, 316.85, 317.87, 318.87, 319.25, 320.13, 321.49, 322.34, 321.62, 319.85, 317.87, 316.36, 316.24, 317.13, 318.46, 319.57, 320.23, 320.89, 321.54, 322.2, 321.9, 320.42, 318.6, 316.73, 317.15, 317.94, 318.91, 319.73, 320.78, 321.23, 322.49, 322.59, 322.35, 321.61, 319.24, 318.23, 317.76, 319.36, 319.5, 320.35, 321.4, 322.22, 323.45, 323.8, 323.5, 322.16, 320.09, 318.26, 317.66, 319.47, 320.7, 322.06, 322.23, 322.78, 324.1, 324.63, 323.79, 322.34, 320.73, 319, 318.99, 320.41, 321.68, 322.3, 322.89, 323.59, 324.65, 325.3, 325.15, 323.88, 321.8, 319.99, 319.86, 320.88, 322.36, 323.59, 324.23, 325.34, 326.33, 327.03, 326.24, 325.39, 323.16, 321.87, 321.31, 322.34, 323.74, 324.61, 325.58, 326.55, 327.81, 327.82, 327.53, 326.29, 324.66, 323.12, 323.09, 324.01, 325.1, 326.12, 326.62, 327.16, 327.94, 329.15, 328.79, 327.53, 325.65, 323.6, 323.78, 325.13, 326.26, 326.93, 327.84, 327.96, 329.93, 330.25, 329.24, 328.13, 326.42, 324.97, 325.29, 326.56, 327.73, 328.73, 329.7, 330.46, 331.7, 332.66, 332.22, 331.02, 329.39, 327.58, 327.27, 328.3, 328.81, 329.44, 330.89, 331.62, 332.85, 333.29, 332.44, 331.35, 329.58, 327.58, 327.55, 328.56, 329.73, 330.45, 330.98, 331.63, 332.88, 333.63, 333.53, 331.9, 330.08, 328.59, 328.31, 329.44, 330.64, 331.62, 332.45, 333.36, 334.46, 334.84, 334.29, 333.04, 330.88, 329.23, 328.83, 330.18, 331.5, 332.8, 333.22, 334.54, 335.82, 336.45, 335.97, 334.65, 332.4, 331.28, 330.73, 332.05, 333.54, 334.65, 335.06, 336.32, 337.39, 337.66, 337.56, 336.24, 334.39, 332.43, 332.22, 333.61, 334.78, 335.88, 336.43, 337.61, 338.53, 339.06, 338.92, 337.39, 335.72, 333.64, 333.65, 335.07, 336.53, 337.82, 338.19, 339.89, 340.56, 341.22, 340.92, 339.26, 337.27, 335.66, 335.54, 336.71, 337.79, 338.79, 340.06, 340.93, 342.02, 342.65, 341.8, 340.01, 337.94, 336.17, 336.28, 337.76, 339.05, 340.18, 341.04, 342.16, 343.01, 343.64, 342.91, 341.72, 339.52, 337.75, 337.68, 339.14, 340.37, 341.32, 342.45, 343.05, 344.91, 345.77, 345.3, 343.98, 342.41, 339.89, 340.03, 341.19, 342.87, 343.74, 344.55, 345.28, 347, 347.37, 346.74, 345.36, 343.19, 340.97, 341.2, 342.76, 343.96, 344.82, 345.82, 347.24, 348.09, 348.66, 347.9, 346.27, 344.21, 342.88, 342.58, 343.99, 345.31, 345.98, 346.72, 347.63, 349.24, 349.83, 349.1, 347.52, 345.43, 344.48, 343.89, 345.29, 346.54, 347.66, 348.07, 349.12, 350.55, 351.34, 350.8, 349.1, 347.54, 346.2, 346.2, 347.44, 348.67} stl, err := NewSTL(12, true, nil, nil, nil) if err != nil { panic(err) } season, trend, residual, weight := stl.Fit(co2) fmt.Printf(&quot;%v\\n\\n&quot;, season) fmt.Printf(&quot;%v\\n\\n&quot;, trend) fmt.Printf(&quot;%v\\n\\n&quot;, residual) fmt.Printf(&quot;%v\\n\\n&quot;, weight) } type stlSmooth struct { len, deg, jmp int } type stlCtx struct { useRW bool value, season, trend, robust []float64 work [5][]float64 } type STL struct { period int robust bool seasonal, trend, lowPass stlSmooth } func NewSTL( period int, robust bool, seasonal, trend, lowPass *stlSmooth) (*STL, error) { if period &lt; 2 { return nil, errors.New(&quot;period must be at least 2&quot;) } if seasonal == nil { seasonal = &amp;stlSmooth{7, 1, 1} } if seasonal.len &lt; 3 || seasonal.len%2 == 0 { return nil, errors.New(&quot;seasonal.len must be an odd number greater than 3&quot;) } else if seasonal.jmp &lt; 0 { return nil, errors.New(&quot;seasonal.jmp must be a positive number&quot;) } if trend == nil { t := int(math.Ceil(1.5 * float64(period) / (1 - 1.5/float64(seasonal.len)))) trend = &amp;stlSmooth{t + (1 - t%2), 1, 1} } if trend.len &lt; 3 || trend.len%2 == 0 || trend.len &lt;= period { return nil, errors.New(&quot;trend.len must be an odd number greater than 3 and period&quot;) } else if seasonal.jmp &lt; 0 { return nil, errors.New(&quot;trend.jmp must be a positive number&quot;) } if lowPass == nil { l := period + 1 lowPass = &amp;stlSmooth{l + (1 - l%2), 1, 1} } if lowPass.len &lt; 3 || lowPass.len%2 == 0 || lowPass.len &lt;= period { return nil, errors.New(&quot;lowPass.len must be an odd number greater than 3 and period&quot;) } else if seasonal.jmp &lt; 0 { return nil, errors.New(&quot;lowPass.jmp must be a positive number&quot;) } return &amp;STL{ period, robust, *seasonal, *trend, *lowPass, }, nil } func (stl *STL) Fit(y []float64) ( season, trend, residual, weight []float64) { var innerIter, outerIter int if stl.robust { innerIter, outerIter = 2, 15 } else { innerIter, outerIter = 5, 0 } n := len(y) var work [5][]float64 for i := 0; i &lt; len(work); i++ { // reserve 2 x p space // temporary seasonal series C range [-p+1, n+p] work[i] = make([]float64, n+2*stl.period) } ctx := &amp;stlCtx{ false, y, make([]float64, n), make([]float64, n), make([]float64, n), work, } for i := 0; i &lt; n; i++ { ctx.robust[i] = 1 } for k := 0; ; k++ { stl.decompose(ctx, innerIter) if k+1 &gt; outerIter { break } stl.robustWeight(ctx) ctx.useRW = true } residual = make([]float64, n) for i := 0; i &lt; n; i++ { residual[i] = ctx.value[i] - ctx.season[i] - ctx.trend[i] } return ctx.season, ctx.trend, residual, ctx.robust } // _onestp func (stl *STL) decompose(ctx *stlCtx, innerIter int) { y, trend, season, work := ctx.value, ctx.trend, ctx.season, ctx.work for n, j := len(ctx.value), 0; j &lt; innerIter; j++ { for i := 0; i &lt; n; i++ { work[0][i] = y[i] - trend[i] // step-1 detrending: work[0] = Y - T } stl.cycleSubSeries(ctx) // step-2 smoothing cycle-sub-series : work[1] = C = CycleSubSeries(work[0]) stl.lowPassFilter(ctx) // step-3 low-pass filtering cycle-sub-series: work[0] = L = LowPassFilter(work[1]) for i := 0; i &lt; n; i++ { // step-4 detrending cycle-sub-series : S = C - L = work[1] - work[0] season[i] = work[1][stl.period+i] - work[0][i] // step-5 deseasonalizing : work[0] = Y - S work[0][i] = y[i] - season[i] } // step-6 trend smoothing : T = smooth(work[0]) stl.trend.smooth(work[0], n, ctx.useRW, ctx.robust, trend, work[2]) } } // _ss func (stl *STL) cycleSubSeries(ctx *stlCtx) { n, period, work, weight := len(ctx.value), stl.period, ctx.work, ctx.robust deTrend, cycle, work1, work2, work3, work4 := work[0], work[1], work[2], work[3], work[4], ctx.season for j := 0; j &lt; period; j++ { k := (n-(j+1))/period + 1 for i := 0; i &lt; k; i++ { work1[i] = deTrend[i*period+j] } if ctx.useRW { for i := 0; i &lt; k; i++ { work3[i] = weight[i*period+j] } } stl.seasonal.smooth(work1, k, ctx.useRW, work3, work2[1:], work4) right := min(stl.seasonal.len, k) work2[0] = stl.seasonal.loess(work1, k, 0, 1, right, work4, ctx.useRW, work3) if math.IsNaN(work2[0]) { work2[0] = work2[1] } left := max(1, k-stl.seasonal.len+1) work2[k+1] = stl.seasonal.loess(work1, k, k+1, left, k, work4, ctx.useRW, work3) if math.IsNaN(work2[k+1]) { work2[k+1] = work2[k] } for m := 0; m &lt; k+2; m++ { cycle[m*period+j] = work2[m] } } } // _fts func (stl *STL) lowPassFilter(ctx *stlCtx) { n, period, work := len(ctx.value)+2*stl.period, stl.period, ctx.work movingAvg(work[1], n, period, work[2]) movingAvg(work[2], n-period+1, period, work[0]) movingAvg(work[0], n-2*period+2, 3, work[2]) stl.lowPass.smooth(work[2], len(ctx.value), false, work[3], work[0], work[4]) } // _rwts func (stl *STL) robustWeight(ctx *stlCtx) { y, n, trend, season, weight := ctx.value, len(ctx.value), ctx.trend, ctx.season, ctx.robust for i := 0; i &lt; n; i++ { weight[i] = math.Abs(y[i] - trend[i] - season[i]) } sorted := ctx.work[0][:n] copy(sorted, weight) slices.Sort(sorted) a, b := sorted[n/2], sorted[n-(n/2)-1] c := 3.0 * (a + b) // outlier threshold = 6 * median if c == 0 { for i := 0; i &lt; n; i++ { weight[i] = 1 } } else { c1, c9 := .001*c, .999*c for i, w := range weight { if w &lt;= c1 { weight[i] = 1 } else if w &lt;= c9 { w /= c w2 := 1 - (w * w) weight[i] = w2 * w2 } else { weight[i] = 0 // outlier } } } } func movingAvg(x []float64, n, step int, avg []float64) { v, s := 0.0, float64(step) for i := 0; i &lt; step; i++ { v += x[i] } avg[0] = v / s for j, k, m := 1, step, 0; j &lt; n-step+1; j++ { v += x[k] - x[m] avg[j] = v / s k, m = k+1, m+1 } } // _ess func (smooth *stlSmooth) smooth(y []float64, n int, useRW bool, robust, ys, tmp []float64) { if n &lt; 2 { ys[0] = y[0] return } // smooth below positions with LOESS: // 1, 1+1*jmp, 1+2*jmp, 1+3*jmp ..., N jmp := min(smooth.jmp, n-1) var left, right int if smooth.len &gt;= n { left, right = 1, n for i := 0; i &lt; n; i += jmp { ys[i] = smooth.loess(y, n, i+1, left, right, tmp, useRW, robust) if math.IsNaN(ys[i]) { ys[i] = y[i] } } } else if jmp == 1 { nsh := (smooth.len + 2) / 2 left, right = 1, smooth.len for i := 0; i &lt; n; i++ { if (i+1) &gt; nsh &amp;&amp; right != n { left, right = left+1, right+1 } ys[i] = smooth.loess(y, n, i+1, left, right, tmp, useRW, robust) if math.IsNaN(ys[i]) { ys[i] = y[i] } } } else { nsh := (smooth.len + 1) / 2 for i := 0; i &lt; n; i += jmp { if (i + 1) &lt; nsh { left, right = 1, smooth.len } else if (i + 1) &gt;= (n - nsh + 1) { left, right = n-smooth.len+1, n } else { left, right = i+1-nsh+1, smooth.len+i+1-nsh } ys[i] = smooth.loess(y, n, i+1, left, right, tmp, useRW, robust) if math.IsNaN(ys[i]) { ys[i] = y[i] } } } if jmp == 1 { return // all positions are smoothed by LOESS } // other position is smoothed by linear interpolation for i := 0; i &lt; (n - jmp); i += jmp { delta := (ys[i+jmp] - ys[i]) / float64(jmp) for j := i; j &lt; i+jmp; j++ { ys[j] = ys[i] + delta*float64((j+1)-(i+1)) } } // make sure position N is smoothed by LOESS k := ((n-1)/jmp)*jmp + 1 if k != n { ys[n-1] = smooth.loess(y, n, n, left, right, tmp, useRW, robust) if math.IsNaN(ys[n-1]) { ys[n-1] = y[n-1] } if k != (n - 1) { delta := (ys[n-1] - ys[k-1]) / float64(n-k) for j := k; j &lt; n; j++ { ys[j] = ys[k-1] + delta*float64((j+1)-k) } } } } // _est func (smooth *stlSmooth) loess(y []float64, n int, x0, left, right int, w []float64, useWeight bool, weight []float64) float64 { // calculate q-neighbourhood weight for point x0 h := max(x0-left, right-x0) if smooth.len &gt; n { h += (smooth.len - n) / 2 } ws := 0.0 h1, h9 := .001*float64(h), .999*float64(h) for j := left - 1; j &lt; right; j++ { w[j] = 0 r := math.Abs(float64(j + 1 - x0)) if r &lt;= h9 { // distance &lt; q-neighbourhood if r &lt;= h1 { w[j] = 1 } else { u := r / float64(h) u3 := 1 - (u * u * u) w[j] = u3 * u3 * u3 } if useWeight { w[j] *= weight[j] // apply robust weight to ignore outlier } ws += w[j] } } if ws &lt;= 0 { // loess weight sum &lt;= 0, can&#39;t smooth position x0, ignore... return math.NaN() } for j := left - 1; j &lt; right; j++ { w[j] /= ws // normalize loess weight } // deg=0 : constant only // deg=1 : constant &amp; trend if h &gt; 0 &amp;&amp; smooth.deg &gt; 0 { a := 0.0 // weighted distance for j := left - 1; j &lt; right; j++ { a += w[j] * float64(j+1) } b := float64(x0) - a c := 0.0 for j := left - 1; j &lt; right; j++ { v := float64(j+1) - a c += w[j] * v * v } rng := .001 * float64(n-1) if math.Sqrt(c) &gt; rng { b /= c for j := left - 1; j &lt; right; j++ { w[j] *= b*(float64(j+1)-a) + 1.0 } } } ys := 0.0 for j := left - 1; j &lt; right; j++ { ys += w[j] * y[j] } return ys } MSTL 分解 MSTL 是一种稳健、准确的季节性趋势分解算法，用于处理具有多个季节性周期的时间序列 与其他替代方案相比，MSTL 具有更高的计算效率，可以处理海量的时间序列数据 算法参数 periods 指定季节成分的数量 windows 每个季节成分的粒度，粒度越小季节成分变化越快，粒度越大季节成分变化越快 lambda 可选的 BoxCox 系数 算法流程如下 填充缺失数据 对数据进行 boxcox 转换（可选） 重复执行 STL 分解，从数据中提取季节成分 返回季节成分与最后一次分解得到的趋势成分 package main import ( &quot;errors&quot; &quot;fmt&quot; &quot;golang.org/x/exp/rand&quot; &quot;gonum.org/v1/gonum/stat/distuv&quot; &quot;math&quot; &quot;sort&quot; ) func main() { t := make([]float64, 1000) norm := distuv.Normal{Mu: 0, Sigma: 1, Src: rand.NewSource(0)} for i := range t { v := float64(i + 1) trend := 0.0001*v*v + 100. dailySeason := 5 * math.Sin(2*math.Pi*v/24) weeklySeason := 10 * math.Sin(2*math.Pi*v/(24*7)) noise := norm.Rand() t[i] = trend + dailySeason + weeklySeason + noise } stl, err1 := NewSTL(2, false, nil, nil, nil) if err1 != nil { panic(err1) } mstl, err2 := NewMSTL([]int{24, 24 * 7}, nil, math.NaN(), 0, stl) if err2 != nil { panic(err2) } season, trend, residual, weight := mstl.Fit(t) fmt.Printf(&quot;%v\\n\\n&quot;, season) fmt.Printf(&quot;%v\\n\\n&quot;, trend) fmt.Printf(&quot;%v\\n\\n&quot;, residual) fmt.Printf(&quot;%v\\n\\n&quot;, weight) } type MSTL struct { stl STL season [][2]int iterate int lambda float64 } func NewMSTL( periods, windows []int, lambda float64, iterate int, stl *STL) (*MSTL, error) { if periods == nil { return nil, errors.New(&quot;periods is required&quot;) } if windows == nil { windows = make([]int, len(periods)) for i := 0; i &lt; len(windows); i++ { windows[i] = 7 + 4*(i+1) } } if len(periods) != len(windows) { return nil, errors.New(&quot;periods and windows must have same length&quot;) } var season [][2]int for i := 0; i &lt; len(periods); i++ { season = append(season, [2]int{periods[i], windows[i]}) } sort.SliceStable(season, func(i, j int) bool { a, b := season[i], season[j] if a[0] == b[0] { return a[1] &lt; b[1] } else { return a[0] &lt; b[0] } }) if iterate &lt;= 0 { iterate = 2 } return &amp;MSTL{*stl, season, iterate, lambda}, nil } func (mstl *MSTL) Fit(y []float64) ( season [][]float64, trend, residual, weight []float64) { for i, half := 0, len(y)/2; i &lt; len(mstl.season); i++ { period := mstl.season[i][0] if period &gt;= half { panic(&quot;a period(s) is larger than half the length of time series&quot;) } } deSeason := make([]float64, len(mstl.season)) copy(deSeason, y) if math.IsNaN(mstl.lambda) { //TODO: BoxCox} n := len(y) season = make([][]float64, len(mstl.season)) for it := 0; it &lt; mstl.iterate; it++ { for i := range mstl.season { if s := season[i]; s != nil { for j := 0; j &lt; n; j++ { deSeason[j] += s[j] } } period, window := mstl.season[i][0], mstl.season[i][1] seasonSmooth := stlSmooth{window, mstl.stl.seasonal.deg, mstl.stl.seasonal.jmp} stl, err := NewSTL(period, mstl.stl.robust, &amp;seasonSmooth, &amp;mstl.stl.trend, &amp;mstl.stl.lowPass) if err != nil { panic(err) } season[i], trend, residual, weight = stl.Fit(deSeason) for j, s := 0, season[i]; j &lt; n; j++ { deSeason[j] += s[j] } } } for i := 0; i &lt; n; i++ { residual[i] = deSeason[i] - trend[i] } return season, trend, residual, weight } "},{"slug":"how-statistics-works","title":"How Statistics Work","tags":["Statistics"],"content":"Statistics is the fundamental cornerstone of Machine Learning (ML) and Artificial Intelligence (AI). It provides the essential tools for understanding data, uncovering patterns, and quantifying uncertainty. From data preprocessing and model training to evaluation, every stage of ML and AI relies heavily on statistical principles. Without statistics, modern machine learning and artificial intelligence wouldn&#39;t exist; it offers the crucial theoretical foundation and methodological guidance for these transformative technologies. Basic Concepts Related Terms Sample Space (Ω) refers to the complete set of all possible outcomes of a random experiment: Rolling a die: $\\Omega_{\\text{dice}}={1,2,3,4,5,6}$ Disk failure interval: $\\Omega_{\\text{MTBF}}=[0,∞)$ Requests per second: $\\Omega_{\\text{QPS}}={1,2,3,...,∞}$ Intraday stock price change: $\\Omega_{\\text{return}}=[-100%,∞)$ Random Variable (X : Ω → R) is a function that maps each elementary outcome in the sample space to a real number: The result of rolling a die is 5: $X_{\\text{dice}} = 5$ Disk runs for more than 100,000 hours without failure: $X_{\\text{MTBF}} &gt; 100\\text k$ QPS of a service at a certain time is 10k: $X_{\\text{QPS}} = 10\\text k$ Stock price drops: $X_{\\text{return}} &lt; 0$ Event is a subset of the sample space, used to describe outcomes of interest and their corresponding probabilities: &quot;Rolling an even number on a die&quot; ${2,4,6} \\supset\\Omega_{\\text{dice}}$ $P(X\\text{ is odd})$ &quot;Rolling less than 3 on a die&quot; ${1,2} \\supset\\Omega_{\\text{dice}}$ $P(X&lt;3)$ &quot;Disk runs for more than 100,000 hours without failure&quot; $[100\\text k,∞) \\supset\\Omega_{\\text{dice}}$ $P(X &gt; 100\\text k)$ &quot;Disk fails between 50,000 and 80,000 hours&quot; $[50\\text k,80\\text k] \\supset\\Omega_{\\text{dice}}$ $P(50\\text k ≤ X ≤ 80\\text k)$ Probability Distribution The sample space contains all possibilities, so the sum of probabilities over the entire sample space is $1$. An event is a subset of the sample space, so the probability of an event ranges from $0$ to $1$. Depending on whether the sample space is countable, random variables can be divided into two types: Discrete random variables: the sample space is countable, and the probability of a specific sample can be calculatedFor example: $P(X_{\\text{dice}} = 5) = 1/6$ Continuous random variables: the sample space is uncountable, and only the probability over an interval can be calculatedFor example: $P(X_{\\text{return}} &lt; 0) = 50%$ When the sample space consists of countable sample points ω, the probability of an event (subset) is calculated as follows: Iterate over each sample point $\\omega$ in the subset Calculate the probability $P(\\omega)$ for each sample point Sum the probabilities $∑P(\\omega)$ However, this naive event probability calculation only applies to discrete variables, not continuous random variables. To unify the calculation of event probabilities, we introduce the concept of probability distributions. A probability distribution is a set of functions that describe the probability of a random variable $X$ taking on various possible values. When using random variables to represent events, we can use the probability distribution to calculate the probability of an event. Probability distributions essentially summarize patterns observed in daily life: If a die is fair, the result of rolling it follows a Uniform Distribution$X_{\\text{dice}} \\sim \\text{Uniform}(a,b)$, where a and b are the minimum and maximum values If the failure rate is independent of usage time, disk MTBF follow an Exponential Distribution$X_{\\text{MTBF}} \\sim \\text{Exp}(\\lambda)$, where λ is the average failure rate If each request is independent, the QPS of a service follows a Poisson Distribution$X_{\\text{QPS}} \\sim \\text{Poisson}(\\lambda)$, where λ is the mean QPS If stock price changes are unpredictable, the log return follows a Normal Distribution$\\ln(X_{\\text{return}}) ~ \\text{Normal}(\\mu,\\sigma)$, where μ is the previous closing price, σ is the volatility and $\\ln(X_{\\text{return}}) =\\ln (S_{t+1}/S_t) = \\ln S_{t+1} − \\ln S_t = ε_{t+1} \\sim \\text{Normal}(\\mu,\\sigma) $ Distribution Functions To accurately describe the probability distribution of a random variable, three core functions are needed: Probability Mass Function $p(x)$ (PMF): describes the probability of a discrete random variable taking a specific value: $p_{X_{\\text{dice}}}(x) = \\text{Uniform}(1,6)$ $p_{X_{\\text{QPS}}}(x) = \\text{Poisson}(10)$ Probability Density Function $f(x)$ (PDF): describes the likelihood of a continuous random variable near a specific value: $f_{X_{\\text{MTBF}}} = \\text{Exp}(1/10000) $ $f_{X_{\\text{log-return}}} = \\text{Normal}(100,0.1)$ Cumulative Distribution Function $F(x)$ (CDF): describes the probability that a random variable is less than or equal to a specific value: $P(X≤x)=F_X(x)$ Probability that $X$ is less than or equal to $x$ $P(X&gt;x)=1−F_X(x)$ Probability that $X$ is greater than $x$ $P(a&lt;X≤b)=F_X(b)-F_X(a)$ Probability that $X$ falls in the interval $(a,b]$ The inverse function of the CDF, $F_X^{-1}(x)$, is called the Inverse CDF or Quantile Function. It is used to calculate the value $x = F_X^{-1}(p)$ such that $P(X≤x) = p$. Although it does not directly describe the probability distribution, it plays an important role in hypothesis testing. Properties of Probability In practice, we mainly focus on two core properties of random variables: Expectation (Mean): the weighted average of all possible values of a random variable and their probabilities Continuous: $E(X)=∫xf(x)dx$ Discrete: $E(X)=∑_ix_ip(x_i)$ Variance: the average squared distance between all possible values and the mean (uncertainty) Continuous: $Var(X)=E[(X−E(X))^2]=∫(x−μ)^2f(x)dx$ Discrete: $Var(X)=E[(X−E(X))^2]=∑(x_i−μ)^2 p(x_i)$ If the probability distribution of a random variable is known, its variance and mean can be directly obtained from the distribution. For example, for $\\text{Uniform}(a,b)$: $E(X)=\\frac{a+b}2 $ $Var(X)=\\frac{(b-a+1)^2-1}{12}$ So for $X_\\text{dice} \\sim \\text{Uniform}(1,6)$: $E(X_\\text{dice})=\\frac{1+6}2=3.5$ $Var(X_\\text{dice})=\\frac{(6-1+1)^2-1}{12}=3$ Sampling Survey In practice, we often face the following situations: The probability distribution of the random variable is unknown The sample space is too large for a census To study the properties of a random variable, we can only draw samples from the population: Population mean: $$\\mu = \\frac{1}{N}\\sum^N_ix_i$$ Population variance: $$\\sigma^2 = \\frac{1}{N}\\sum^n_i(x_i -\\mu)^2$$ Sample mean: $$\\bar{x} = \\frac{1}{n}\\sum^n_ix_i$$ Sample variance (biased): $$\\hat{\\sigma^2} = \\frac{1}{n}\\sum^n_i(x_i -\\bar{x})^2$$ Sample variance (unbiased): $$s^2 = \\frac{1}{n-1}\\sum^n_i(x_i -\\bar{x})^2$$ The purpose of sampling is to obtain estimates $\\bar x,s^2$ that are as close as possible to the population $\\mu,\\sigma^2$. When the number of samples $m$ is large enough, the sample mean approaches the population mean: $\\mu = \\frac{1}{m}\\sum^m_{i}X_i\\ (m \\to \\infty)$. The population size $N$ is usually very large, while $n$ is the sample size in a single survey. When $n \\ll N $, the probability of drawing extreme values is small, leading to an underestimation of variance, so $\\hat{\\sigma^2}$ is biased downward. When the sample size is small, the unbiased estimate $s^2$ is usually used as an approximation for $\\sigma^2$. Normal Distribution Central Limit Theorem With the concept of random variables, we can study sample data as follows: Assume the population follows a certain distribution Estimate the parameters of that distribution from the sample Analyze based on the probability distribution function When there is not enough prior knowledge, it is usually assumed that the random variable follows a normal distribution The theoretical basis for this assumption is Asymptotic Normality in statistics: As the sample size approaches infinity, the distribution of the random variable approaches a normal distribution A special case of asymptotic normality is the Central Limit Theorem (CLT): For a random variable following any distribution, as the number of samples increases, the mean will always follow a normal distribution To help understand, here are some interactive experiments where you can adjust the sample size and observe the distribution: Uniform distribution sampling Exponential distribution sampling Asymptotic normality and the central limit theorem are the foundation of many statistical applications: Based on the sample mean and standard deviation, you can construct a confidence interval for the population mean, even if the population distribution is unknown This allows us to make statistical inferences based on the normal distribution for many non-normal data sets. The formal statement of the central limit theorem (can be skipped): Given a random variable $X$ following any distribution. Let $$Y_n$$ be the mean of $n$ samples of $X$: $$Y_n = \\frac{1}{n}\\sum^n_{i=1}X_i$$, with: Expectation $$E[Y_n] = E[\\frac{1}{n}\\sum^n_{i=1}X_i] = \\frac{1}{n}\\sum^n_{i=1}E[X_i] = \\frac{1}{n}nE[X] = E[X]$$ Variance $$Var[Y_n] = Var[\\frac{1}{n}\\sum^n_{i=1}X_i] = \\frac{1}{n^2}\\sum^n_{i=1}Var[X_i] = \\frac{1}{n^2}nVar(X) = \\frac{Var(X)}{n}$$ When $n &gt; 30$, $$Y_n$$ follows a normal distribution $$Y_n \\sim \\mathcal{N}(E[X],\\frac{Var(X)}{n})$$, and the standardized form $$\\frac{Y_n-E[X]}{\\sqrt{Var(X)/n}} \\sim \\mathcal{N}(0,1)$$. Normal Distribution Normal Distribution is one of the most important and common continuous probability distributions in probability and statistics. Its probability density function forms the characteristic bell curve. Most data points are concentrated near the mean, and the further from the mean, the less likely they are to occur. The normal distribution is fully determined by two parameters: Mean (μ): determines the center of the distribution Standard Deviation (σ): determines the &quot;width&quot; or spread Larger standard deviation: more spread out, flatter and wider bell curve Smaller standard deviation: more concentrated, taller and narrower bell curve The normal distribution gives rise to three commonly used distributions in statistics: Z Distribution Any normal random variable $X$ can be transformed into a standard normal variable $Z = \\frac{(X−μ)}σ \\sim \\mathcal N(0,1)$. The value of $Z$ is called the z-score, indicating how many standard deviations $X$ is from its mean $μ$. We will use the z-score to construct confidence intervals. Chi-Square Distribution Suppose $Z_1,...Z_k$ are $k$ independent standard normal random variables $\\mathcal N(0,1)$. Their sum of squares follows a chi-square distribution with $k$ degrees of freedom: $Z_1^2+\\cdots+Z_k ^2 \\sim χ^2(k)$. Used to test independence and homogeneity between multiple random variables. t Distribution Given $Z \\sim N(0,1)$ and $V \\sim χ^2(k) $, the ratio $T=\\frac{Z}{\\sqrt{V/k}} \\sim t(k) $ follows a t-distribution with $k$ degrees of freedom. Mainly used for statistical inference on the mean of a normal population when the population standard deviation is unknown and the sample size is small. Empirical Rule The core value of the normal distribution is the Empirical Rule: About 68% of data falls within 1 standard deviation of the mean ($μ±σ$) About 95% falls within 2 standard deviations ($μ±2σ$) About 99.7% falls within 3 standard deviations ($μ±3σ$) Although the empirical rule is only approximate, it provides valuable insight and decision support in many practical applications. For example, the probability of a data point falling more than 3 standard deviations from the mean is less than 0.3%. Thus, the 3-sigma rule can be used to quickly identify outliers. In practice, 6-sigma is often used instead of 3-sigma: 3-sigma means a 99.73% pass rate, i.e., 2700 defects per million opportunities (DPMO) 6-sigma means a 99.99966% pass rate, i.e., only 3.4 defects per million opportunities In many modern industries and services, a 3-sigma defect rate is unacceptable: Healthcare: a 2700‱ medical error rate would harm many patients Aerospace: a 2700‱ defect rate in aircraft parts would be catastrophic Financial services: a 2700‱ transaction error rate would cause huge losses and trust crises Only the near-zero defect pursuit of 6-sigma meets today&#39;s high quality requirements. Confidence Interval In statistics, we often face the following errors: The estimated mean $\\bar x$ from the sample differs from the true population mean $\\mu$ The estimated regression parameter $\\bar \\beta$ differs from the true value $\\beta$ The predicted value $\\bar y$ from regression differs from the actual value $y$ These errors are unknown and unavoidable due to the unknowability of the population. To measure the effectiveness of a statistical task, we introduce the concept of a Confidence Interval: Treat the estimate as the sample mean $x̄$ of a normal distribution Set a symmetric interval centered on the population mean $μ$ Calculate the probability that the sample mean $x̄$ falls within this interval The two sides of the confidence interval are called the Margin of Error: The probability that $x̄$ falls outside the margin is the Significance Level (α) The probability that $x̄$ falls inside the margin is the Confidence Level ($1-α$) There are two ways to interpret the confidence level: The probability that $x̄$ falls within the confidence interval centered on $μ$ The probability that the confidence interval centered on $x̄$ contains $μ$ With a fixed sample size $n$: Higher confidence level means a larger interval, higher probability of containing the mean, but greater error Lower confidence level means a smaller interval, lower probability of containing the mean, but smaller error With the same confidence level $1-α$: More samples mean smaller variance, less error between sample and population mean Fewer samples mean larger variance, more error between sample and population mean The formulaic expression for confidence level $1-α$ (can be skipped): According to CLT, the sample mean $$\\bar{x} $$ follows a normal distribution $$\\bar{X} \\sim \\mathcal{N}(\\mu,\\frac{\\sigma^2}{n})$$, and the standardized form $$\\frac{\\bar{X} - \\mu}{\\frac{\\sigma}{\\sqrt{n}}} \\sim \\mathcal{N}(0,1)$$ Calculate the sample mean $$\\bar{x} $$ and standard deviation $$\\frac{\\sigma}{\\sqrt{n}}$$ (assuming population standard deviation $$\\sigma$$ is known) Given confidence level $$1-α$$ Use $F_Z^{-1}(\\frac{1-α}2)$ to calculate the corresponding z-score $$z_{(1-α)/2}$$ = $$\\frac{\\bar{x} - \\mu}{\\frac{\\sigma}{\\sqrt{n}}} $$ The margin for the sample mean is $$\\mu - z_{(1-α)/2} \\cdot \\frac{\\sigma}{\\sqrt{n}} &lt; \\bar{x} &lt; \\mu + z_{(1-α)/2} \\cdot \\frac{\\sigma}{\\sqrt{n}}$$ Rearranged: $$\\bar{x} - z_{(1-α)/2} \\cdot \\frac{\\sigma}{\\sqrt{n}} &lt; \\mu &lt; \\bar{x} + z_{(1-α)/2} \\cdot \\frac{\\sigma}{\\sqrt{n}}$$ So we are $1-α$ confident that the mean $\\mu$ is in the interval $$\\bar{x} \\pm z_{(1-α)/2} \\cdot \\frac{\\sigma}{\\sqrt{n}}$$. Usually, the population standard deviation $$\\sigma$$ in $$\\frac{\\bar{x} - \\mu}{\\frac{\\sigma}{\\sqrt{n}}}$$ is unknown, so the unbiased sample standard deviation $s$ is used instead. The resulting $$\\frac{\\bar{x} - \\mu}{\\frac{s}{\\sqrt{n}}}$$ no longer follows the Z distribution, but a t-distribution with $n-1$ degrees of freedom $$\\mathcal{t}{n-1}$$. The statistic is replaced by the t-score, and the confidence interval is $$\\bar{x} \\pm t{(1-α)/2} \\cdot \\frac{s}{\\sqrt{n}}$$. Hypothesis Testing Hypothesis testing is a very important tool in statistical inference. It allows us to make judgments or inferences about population parameters based on sample data. Simply put, it is about using sample data to judge whether a hypothesis about the population is true. There is always error between sample statistics and the true population value. Hypothesis testing helps us make data-driven decisions under uncertainty. It provides a rigorous framework for assessing whether observed phenomena are due to random fluctuation or a real effect. The core idea is proof by contradiction: Propose a hypothesis $H_0$, then collect sample data Calculate the probability of observing the sample data under $H_0$ If the probability is very small, $H_0$ is inconsistent with reality and is rejected In hypothesis testing, two mutually exclusive hypotheses are proposed: Null Hypothesis ($H_0$): usually the statement the researcher wants to reject, or a statement of &quot;no effect&quot; or &quot;no difference&quot; Alternative Hypothesis ($H_1$): usually the statement the researcher wants to prove, the opposite of $H_0$ When $H_0$ is rejected, $H_1$ is accepted Note that both $H_0$ and $H_1$ are hypotheses about the unknown population: Hypotheses must be set based on the population, e.g., $$H_0: \\mu = 12$$ Hypotheses cannot be set based on the sample, e.g., $$H_0: \\bar{x} \\ge 12$$ Significance Level Using $H_0$ to represent the hypothesis to be rejected: $H_0 = \\text{true}$ means the hypothesis is not true (inconsistent with reality) $H_0 = \\text{false}$ means the hypothesis is true (consistent with reality) When making decisions, there are four possibilities: Reject $H_0$ and $H_0 = \\text{true}$ (Type I error) Reject $H_0$ and $H_0 = \\text{false}$ (Correct) Fail to reject $H_0$ and $H_0 = \\text{true}$ (Correct) Fail to reject $H_0$ and $H_0 = \\text{false}$ (Type II error) There are two types of errors: Type I Error: rejecting a true hypothesis (False Positive) Type II Error: failing to reject a false hypothesis (False Negative) Due to sampling error, both types of errors are unavoidable. Their probabilities are: Type I error: $$\\alpha = P(\\text{Reject } H_0|H_0)$$ Type II error: $$\\beta = P(\\text{Accept } H_0|H_1)$$ The significance level α determines whether to reject $H_0$ By setting the significance level, you can control the probability of these errors: $H_0$ is set based on the population mean $\\mu$ Lower α means a larger confidence interval $1-α$ The probability that the sample mean $\\bar x$ falls outside the interval is lower This means stronger evidence is needed to reject $H_0$ Adjusting α affects both error probabilities: Lower α: $H_0$ is more likely to be accepted, less likely to be rejected. Reduces Type I error risk, increases Type II error risk. Higher α: $H_0$ is more likely to be rejected, less likely to be accepted. Reduces Type II error risk, increases Type I error risk. Another important metric is the Power of Test: $1-β = P(\\text{Reject } H_0|H_1)$, i.e., When $H_1$ is true, the probability of correctly rejecting $H_0$ Higher power means higher sensitivity and a greater ability to detect real effects. Test Statistics When designing a test, you need to choose the appropriate statistical test based on the data type, distribution, sample size, and hypothesis type, for example: Z-test: when the population standard deviation $\\sigma$ is known and the sample size is large, to test whether the sample mean $\\bar x$ matches the population mean $\\mu$ in $H_0$ t-test: when the population standard deviation $\\sigma$ is unknown or the sample size is small, to test whether the sample mean $\\bar x$ matches the population mean $\\mu$ in $H_0$ F-test: compares the variances of two samples to test whether their population variances are equal Chi-square test: uses the chi-square statistic to test whether two samples follow the same population distribution Different tests use different Test Statistics: One-sample Z-test: $Z=\\frac{\\bar{x}-\\mu}{\\sigma/\\sqrt n}$ $μ$: hypothesized population mean $σ$: known population standard deviation $\\bar{x}$: sample mean $n$: sample size One-sample t-test: $T=\\frac{\\bar X-\\mu}{s\\sqrt n}$ $μ$: hypothesized population mean $\\bar{x}$: sample mean $s$: sample standard deviation $n$: sample size F-test: $F=\\frac{s_1^2}{s_2^2}$ $s_1^2$: larger sample variance $s_2^2$: smaller sample variance Chi-square test: $\\chi^2=\\sum\\frac{(O_i-E_i)^2}{E_i}$ $E_i$: Expected Frequency $O_i$: Observed Frequency All test statistics are random variables that follow their respective probability distributions. After calculating the statistic, the next step is to use the significance level α to decide whether to reject $H_0$. Types of Tests For clarity, the following uses the Z mean test as an example. Given a population variance $σ = 3$ and a hypothesis $μ = 66.7$, three types of tests can be designed: Right-tailed test $H_0: μ ≤ 66.7$ $H_1: μ &gt; 66.7$ Left-tailed test $H_0: μ ≥ 66.7$ $H_1: μ &lt; 66.7$ Two-tailed test $H_0: μ = 66.7$ $H_1: μ ≠ 66.7$ Which type to use depends on your research question and the direction of difference you are looking for: Test Type $H_1$ Form Rejection Region Purpose Right-tailed parameter &gt; value right prove parameter &quot;increased&quot; or is &quot;greater than&quot; a benchmark Left-tailed parameter &lt; value left prove parameter &quot;decreased&quot; or is &quot;less than&quot; a benchmark Two-tailed parameter ≠ value both prove parameter is &quot;different from&quot; a benchmark, regardless of direction Drawing Conclusions After calculating the test statistic, there are two ways to make a decision: Calculate the p-value and compare it to the significance level α Calculate the critical value for α and compare the test statistic to the critical value p-value The p-value is the probability of observing the current data or more extreme data under $H_0 = \\text{true}$. The p-value is compared to the preset significance level α: $\\text{p-value} ≤α$: the result is rare under $H_0 = \\text{true}$, strong evidence to reject $H_0$ $\\text{p-value} &gt;α$: the result is common under $H_0 = \\text{true}$, not enough evidence to reject $H_0$ The p-value is essentially a probability, and can be calculated using the CDF: Z statistic: $X \\sim Z$, use $F_X(x)$ to get the p-value for $x$ Chi-square statistic: $X \\sim \\chi^2$, use $F_X(x)$ to get the p-value for $x$ Let $X$ be the hypothesized population distribution, $x_{\\text{obs}}$ the observed statistic. The p-value for different test types: Left-tailed: $\\text{p-value}=P(X≤x_{\\text{obs}}|H_0=true)=F_X(x_{\\text{obs}})$ Right-tailed: $\\text{p-value}=P(X≥x_{\\text{obs}}|H_0=true)=1-F_X(x_{\\text{obs}})$ Two-tailed (symmetric): $\\text{p-value}=2×P(X≥|x_{\\text{obs}}||H_0=true)=2×F_X(|x_{\\text{obs}}|)$ Returning to the Z mean test example, the decision process using p-value: Right-tailed: $H_1: \\mu &gt; 66.7$ Sample mean $\\bar x = 68.442$ Statistic $Z = \\frac{68.442 - 66.7}{3/\\sqrt{10}} = 1.8362$ CDF $F_Z(1.8362) = 0.9668 $ Right-tail p-value $P(\\hat{X}&gt;68.442|\\mu=66.7) = 1 - 0.9668 =0.0332$ Since p-value &lt; 0.05, it is reasonable to reject $H_0$ and accept $H_1$ Two-tailed: $H_1: \\mu \\ne 66.7$ Sample mean $\\bar x = 68.442$ Statistic $Z = \\frac{68.442 - 66.7}{3/\\sqrt{10}} = 1.8362$ CDF $F_Z(1.8362) = 0.9668 $ Two-tailed p-value $P(|\\hat{X}-66.7|&gt;|68.442 - 66.7|\\ |\\mu=66.7) = (1-0.9668) * 2 = 0.0663$ Since p-value &gt; 0.05, it is reasonable to accept $H_0$ and reject $H_1$ Left-tailed: $H_1: \\mu &lt; 64.252$ Sample mean $\\bar x = 64.252$ Statistic $Z = \\frac{64.252 - 66.7}{3/\\sqrt{10}} = −2.581$ CDF $F_Z(−2.581) = 0.0049 $ Left-tail p-value $P(\\hat{X}&lt;64.252|\\mu=66.7) = 0.0049 $ Since p-value &lt; 0.05, it is reasonable to reject $H_0$ and accept $H_1$ Critical Value Another intuitive approach is to compare the Critical Value. The critical value has the same unit as the statistic, so they can be directly compared. To convert α to the corresponding critical value, use the Inverse CDF. For significance level α, the critical value for different test types: Left-tailed: $k_{\\alpha}=F^{−1}(α)$ Right-tailed: $k_{\\alpha}=F^{−1}(1-α)$ Two-tailed (symmetric): $k_{\\alpha 2}=F_X^{−1}(α/2),\\ k_{\\alpha 1}=F_X^{−1}(1-\\frac{α}2)$ If the observed value $x_{\\text{obs}}$ (not the statistic) is more extreme than the critical value, we have enough reason to reject $H_0$: Left-tailed: $x_{\\text{obs}} &lt; k_{\\alpha}$ Right-tailed: $x_{\\text{obs}} &gt; k_{\\alpha}$ Two-tailed: $x_{\\text{obs}} &lt; k_{\\alpha 2} \\text{ or } x_{\\text{obs}} &gt; k_{\\alpha 1}$ Returning to the Z mean test example, the decision process using critical value: Right-tailed: $H_1: \\mu &gt; 66.7$ Sample mean $\\bar x = 68.442$ Inverse CDF $F_Z^{-1}(1-0.05) = 1.645 $ Map Z to $\\mathcal N(66.7,3)$: right-tail critical value $k_{\\alpha} = 66.7 + 1.645 \\times (3/\\sqrt{10}) = 68.2607$ Since $68.442 &gt; 68.2607$, it is reasonable to reject $H_0$ and accept $H_1$ Two-tailed: $H_1: \\mu \\ne 66.7$ Sample mean $\\bar x = 68.442$ Inverse CDF: $F_{Z 2}^{-1}(0.05/2) = −1.96$ $F_{Z 1}^-1(1-0.05/2) = 1.96$ Map Z to $\\mathcal N(66.7,3)$: critical values $k_{\\alpha 2} = 66.7 - |−1.96| \\times (3/\\sqrt{10}) = 64.8406$ $k_{\\alpha 1} = 66.7 + |1.96| \\times (3/\\sqrt{10}) = 68.5594$ Since $68.442 \\in [64.8406,68.5594]$, it is reasonable to accept $H_0$ and reject $H_1$ Left-tailed: $H_1: \\mu &lt; 64.252$ Sample mean $\\bar x = 64.252$ Inverse CDF $F_Z^{-1}(0.05) = -1.645 $ Map Z to $\\mathcal N(66.7,3)$: left-tail critical value $k_{\\alpha} = 66.7 - 1.645 \\times (3/\\sqrt{10}) = 65.1393$ Since $64.252 &lt; 65.1393$, it is reasonable to reject $H_0$ and accept $H_1$ Summary of Process To summarize, the basic steps of hypothesis testing are: Propose mutually exclusive hypotheses Choose a significance level Choose the appropriate statistical test method Calculate the test statistic Draw a conclusion based on the corresponding p-value or critical value Note: Failing to reject the null hypothesis does not mean it is true, only that there is not enough evidence to prove it is false. "},{"slug":"you-need-to-know-about-memory","title":"You Need To Know About Memory","tags":["OS"],"content":"With the assistance of operating systems and compilers, developers can craft efficient and stable code without delving into the intricacies of hardware. However, to fully harness the potential of hardware resources, programmers must gain a deeper understanding of hardware architectures and implementation principles. As computer hardware has evolved, CPU and memory have taken divergent paths. CPUs prioritize speed, while memory focuses on capacity. This disparity has resulted in a widening gap between CPU core frequencies and memory bus speeds. To bridge this gap, modern hardware architectures have introduced several significant innovations: Multi-level Caching: Multiple layers of high-speed storage are introduced between CPU cores and main memory to store copies of recently accessed instructions and data, enabling faster access. Non-Uniform Memory Access (NUMA): Memory is divided into independent blocks and managed by different CPU cores, preventing memory access bus frequencies from limiting CPU memory access efficiency. Cache Program code and data exhibit both temporal and spatial locality of reference, meaning that the same code or data is likely to be reused within a short time frame. Therefore, modern hardware architectures introduce a multi-level cache with limited space between the CPU and memory. Caches closer to the CPU are typically faster and smaller. block-beta columns 6 Core0:2 Core1:2 Core2:2 L1D0[&quot;L1 Data&quot;] L1I0[&quot;L1 Inst&quot;] L1D1[&quot;L1 Data&quot;] L1I1[&quot;L1 Inst&quot;] L1D2[&quot;L1 Data&quot;] L1I2[&quot;L1 Inst&quot;] L20[&quot;L2&quot;]:2 L21[&quot;L2&quot;]:2 L22[&quot;L2&quot;]:2 L3:6 Memory:6 classDef core0 fill:#e0eafd,stroke:#777; classDef core1 fill:#fee3df,stroke:#777; classDef core2 fill:#dff5e3,stroke:#777; classDef l3 fill:#fef1ce,stroke:#777; class Core0,L1D0,L1I0,L20 core0 class Core1,L1D1,L1I1,L21 core1 class Core2,L1D2,L1I2,L22 core2 class L3 l3 The L1 cache serves as a private memory haven for each CPU core. Typically divided into two distinct sections: Instruction Cache (L1i): Stores decoded machine instructions. Data Cache (L1d): Holds recently accessed data. L2 and L3 are unified caches with no clear functional division. L3 is shared by multiple cores, and whether L2 is shared depends on the specific CPU architecture. Cache Coherence The most basic storage unit in a cache is called a line, typically sized at 64 bytes: When a CPU accesses a piece of memory data, it loads adjacent data into the cache as well to improve access efficiency. When a CPU modifies data in the cache, it needs to synchronize the modification with memory. There are two optional strategies: Write-through: Synchronously update both the cache and memory. Write-back: Mark the cache line as dirty, and write it back to memory only when it is evicted. block-beta columns 3 space:3 block:cpu1[&quot;CPU A&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot;] thread1[&quot;Thread A&quot;] end space block:cpu2[&quot;CPU B&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot;] thread2[&quot;Thread B&quot;] end space:3 block:cache1[&quot;Cache A&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot;] line11[&quot;Line1&quot;] line12[&quot;Line2&quot;] line13[&quot;Line3&quot;] line14[&quot;Line4&quot;] end space block:cache2[&quot;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;Cache B&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot;] line21[&quot;Line1&quot;] line22[&quot;Line2&quot;] line23[&quot;Line3&quot;] line24[&quot;Line4&quot;] end space:3 block:memory[&quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Memory&quot;]:3 columns 4 A[&quot;64 Byte&quot;] B[&quot;...&quot;] C[&quot;...&quot;] D[&quot;...&quot;] end space:3 thread1 --&gt; line14 thread2 --&gt; line22 C --&gt; line14 C --&gt; line22 classDef title color:#777 classDef core0 fill:#e0eafd,stroke:#777; classDef core1 fill:#fee3df,stroke:#777; classDef core2 fill:#dff5e3,stroke:#777; class cpu1,cpu2,cache1,cache2,memory title class thread1,line14 core0 class thread2,line22 core1 class C core2 Although the write-back strategy effectively saves memory bandwidth, it introduces the problem of concurrent modifications with multiple copies: The same cache line corresponds to different copies in the exclusive caches of different CPUs, and different CPUs can concurrently modify this data. Therefore, the copy of this cache line in the caches of other CPUs needs to be invalidated and later reloaded. To establish a coherent cache view among multiple CPUs, a cache coherence protocol is needed. The cache coherence protocol orders write operations to the same location to ensure that all CPUs observe the state changes of that location in the same order. One popular cache coherence protocol is MESI, which sets one of the following four states for cache lines: Modified: The line is loaded only in the current CPU&#39;s cache, and its content is inconsistent with memory. Exclusive: The line is loaded only in the current CPU&#39;s cache, and its content is consistent with memory. Shared: The line is loaded in the caches of multiple CPUs, and its content is consistent with memory. Invalid: The cache line is invalid. State transitions are as follows: Initially, all cache line states are I. After loading data, their states become S or E. When the local CPU modifies it, its state becomes M. When other CPUs modify their copies, its state becomes I. Request For Ownership Distinguishing between S and E states primarily serves performance considerations: When modifying a cache line in the E state, its state can directly be transitioned to M. When modifying a cache line in the S state, an RFO (Request For Ownership) message needs to be sent to other CPUs. This notifies them to invalidate their corresponding local copies (I) and transmit their content back to the initiator of the RFO. Only then can the cache line&#39;s state be transitioned to M. Two scenarios frequently trigger RFO communication and should be avoided as much as possible: Parallel running threads on different cores accessing the same cache line data (e.g., thread scheduling disregarding CPU affinity). The same thread alternating between different CPU cores, necessitating data movement between the local caches of different CPUs (e.g., cache pseudo-sharing between threads). The communication of multi-threaded applications relies on RFO for memory synchronization. Hence, its concurrency is not only limited by the number of CPU cores but also by the communication latency introduced by memory synchronization. Careful program design is necessary to minimize accesses to the same memory location from different processors. Concurrent Modifications MESI cache consistency only guarantees the order of values, not the order of write operations themselves. Suppose a memory location containing a counter with a value of 1 is in state S, and two threads need to perform increment operations on it simultaneously. The CPU does not need to wait for the cache line to transition to state E before fetching the value from the cache and adding it. Instead, it will directly add the current value 1 in the cache to get the new value 2. The new value will be ordered based on the MESI protocol, and once the cache line is available in state E, the new value 2 will be written to the cache line. If the cache reads from these two threads happen concurrently, one of the increment modification operations will be lost: two increment operations have occurred, with an expected value of 3, but the actual result is 2. Concurrent writes to the same memory location can lead to unpredictable results. To ensure data integrity, atomic operation instructions provided by the CPU should be used. Special Address Spaces Certain special address spaces have no corresponding physical memory and cannot comply with the aforementioned write-back strategy. These spaces mainly fall into two categories: Mapped to peripheral memory (e.g., graphics card memory) Used to control peripherals themselves (e.g., microcontroller LED addresses) The former typically employs a write-combining strategy, where multiple consecutive write operations are written back to peripheral memory only once they are complete. The latter usually adopts an uncacheable strategy, and data in these addresses is not cached by the CPU. Performance Guidelines Instruction Reduction The larger the code size, the greater the pressure on the L1 instruction cache. Therefore, do not overuse loop unrolling and inlining unless it has a significant performance improvement. There are two main criteria for evaluating whether a function should be inlined, and the product of these two can be used to estimate the increase in code size after inlining: Size of the function body Number of function calls For functions with a low number of calls or a small code size, inlining is often beneficial. However, for small, high-frequency functions, their code instructions are likely to be present in the L1 cache. If the L1 content can be reused and the overall space occupied is reduced, the performance overhead introduced by additional function calls can usually be compensated for. At this point, you can choose to disable inlining to improve cache hit rate and thus overall performance. GCC provides two compiler attributes, always_inline and noinline, for programmers to control whether to inline. void __attribute__((noinline)) my_function(int arg) { // Function body } void __attribute__((always_inline)) my_function(int arg) { // Function body } Branch Elimination To improve processing efficiency, modern CPUs operate in a pipeline: prefetching and decoding subsequent instructions to be executed while executing the current instruction. block-beta columns 9 Cycle[&quot;Cycle&quot;]:2 1 2 3 4 5 6 7 Fetch:2 f1[&quot;A&quot;] f2[&quot;B&quot;] f3[&quot;C&quot;] f4[&quot; &quot;] f5[&quot; &quot;] f6[&quot; &quot;] f7[&quot; &quot;] Decode:2 d1[&quot; &quot;] d2[&quot;A&quot;] d3[&quot;B&quot;] d4[&quot;C&quot;] d5[&quot; &quot;] d6[&quot; &quot;] d7[&quot; &quot;] Execute:2 e1[&quot; &quot;] e2[&quot; &quot;] e3[&quot;A&quot;] e4[&quot;B&quot;] e5[&quot;C&quot;] e6[&quot; &quot;] e7[&quot; &quot;] Memory:2 m1[&quot; &quot;] m2[&quot; &quot;] m3[&quot; &quot;] m4[&quot;A&quot;] m5[&quot;B&quot;] m6[&quot;C&quot;] m7[&quot; &quot;] Write:2 w1[&quot; &quot;] w2[&quot; &quot;] w3[&quot; &quot;] w4[&quot; &quot;] w5[&quot;A&quot;] w6[&quot;B&quot;] w7[&quot;C&quot;] classDef empty color:#777,fill:none,stroke:none; classDef o fill:#bcc0c2,stroke:#666c73,stroke-width:2px classDef a fill:#e0eafd,stroke:#537ac5,stroke-width:2px classDef b fill:#fee3df,stroke:#d35f5c,stroke-width:2px classDef c fill:#dff5e3,stroke:#5a996c,stroke-width:2px class Cycle,1,2,3,4,5,6,7 empty class Fetch,Decode,Execute,Memory,Write o class f1,d2,e3,m4,w5 a class f2,d3,e4,m5,w6 b class f3,d4,e5,m6,w7 c However, when there are branch jumps in the code, the prefetched instructions may not be the ones that actually need to be executed. Take the following code for example: if (TEST()) // TEST() == false BR1() ... else BR2() ... The code for branch BR1 follows immediately after the conditional check TEST, so it will be prefetched into the pipeline. However, the branch that actually needs to be executed is BR2. At this point, pipeline will be stalled, and the CPU can only wait for the BR2 instruction to finish loading. block-beta columns 11 Cycle[&quot;Cycle&quot;]:2 1 2 3 4 5 6 7 8 9 Fetch:2 f1[&quot;TEST&quot;] f2[&quot;BR1&quot;] f3[&quot;...&quot;] f4[&quot;BR2&quot;] f5[&quot;...&quot;] f6[&quot; &quot;] f7[&quot; &quot;] f8[&quot; &quot;] f9[&quot; &quot;] Decode:2 d1[&quot; &quot;] d2[&quot;TEST&quot;] d3[&quot;BR1&quot;] d4[&quot;...&quot;] d5[&quot;BR2&quot;] d6[&quot;...&quot;] d7[&quot; &quot;] d8[&quot; &quot;] d9[&quot; &quot;] Execute:2 e1[&quot; &quot;] e2[&quot; &quot;] e3[&quot;TEST&quot;] e45[&quot;Stall&quot;]:2 e6[&quot;BR2&quot;] e7[&quot;...&quot;] e8[&quot; &quot;] e9[&quot; &quot;] Memory:2 m1[&quot; &quot;] m2[&quot; &quot;] m3[&quot; &quot;] m4[&quot;TEST&quot;] m56[&quot;Stall&quot;]:2 m7[&quot;BR2&quot;] m8[&quot;...&quot;] m9[&quot; &quot;] Write:2 w1[&quot; &quot;] w2[&quot; &quot;] w3[&quot; &quot;] w4[&quot; &quot;] w5[&quot;TEST&quot;] w67[&quot;Stall&quot;]:2 w8[&quot;BR2&quot;] w9[&quot;...&quot;] classDef empty color:#777,fill:none,stroke:none; classDef o fill:#bcc0c2,stroke:#666c73,stroke-width:2px classDef a fill:#e0eafd,stroke:#537ac5,stroke-width:2px classDef b fill:#fee3df,stroke:#d35f5c,stroke-width:2px classDef c fill:#dff5e3,stroke:#5a996c,stroke-width:2px classDef stall color:#777,fill:none,stroke-width:2px,stroke-dasharray: 3 5 class Cycle,1,2,3,4,5,6,7,8,9 empty class Fetch,Decode,Execute,Memory,Write o class f1,d2,e3,m4,w5 a class f2,f3,d3,d4 b class f4,f5,d5,d6,e6,e7,m7,m8,w8,w9 c class e45,m56,w67 stall To reduce idle time, modern CPUs use branch prediction to guess the target code of the jump and preload the corresponding instructions into the cache: When branch prediction is correct, it can effectively improve CPU execution efficiency. When branch prediction is wrong, the useless instructions will be loaded into the L1 cache, which will actually slow down the CPU&#39;s execution. Although the L1 instruction cache cannot be directly controlled, the instruction prefetch hit rate can be improved by reducing the branch in the code: // branch int add_condtional(int a, unsigned int b) { if (b &lt; 16) a += b; return a; } // branchless int add_condtional(int a, unsigned int b) { a += b &amp; 0xf; return a; } Code Layout Before instructions enter the L1 cache, their corresponding code is also prefetched into the L2 cache. When there are branches in the code, the code layout can affect its hit rate in the L2 cache. Take the following code as an example, where the function contains three adjacent code blocks A, B, and C. These code blocks will be loaded into the L2 cache in the form of cache lines. And the branch judgment condition in the function is likely to be false, which means that the probability of the B code block being executed is much lower than C. int branch_layout() { ... code block A ... if (I()) { // I represents conditional jump instruction ... code block B ... } ... code block C ... } If the B code block is large, a large amount of useless code will be prefetched into the L2 cache each time the function is called. And if the branch prediction is wrong, a large number of useless instructions will also be loaded into the L1 cache. For this type of branch code with a low trigger probability, it can be extracted into a non-inlined function. Place it in a separate code block and simplify the low-probability branch into a function call instruction. Programmers can use two GCC macros to hint to the compiler which code branch has a higher execution probability, and then turn on the -freorder-blocks optimization option during compilation to automatically adjust the code layout. #define unlikely(expr) __builtin_expect(!!(expr), 0) #define likely(expr) __builtin_expect(!!(expr), 1) Cache Line Alignment To ensure access efficiency, memory addresses of structures are aligned by default according to the following rules: The starting address of a field must be a multiple of n (n is the byte size of the field type). The starting address of a structure must be a multiple of the length of its longest type field. However, for shared data that requires frequent read-write operations, the above alignment method still not enough. Suppose two threads, A and B, respectively hold counters X and Y, and X and Y happen to be allocated in the same cache line. In this scenario, even if the data modified by these two threads are completely unrelated, they will severely impact each other&#39;s access performance. block-beta columns 3 thread1[&quot;Thread A&quot;] space thread2[&quot;Thread B&quot;] space:3 block:cache1 line11[&quot;Line1&quot;] line12[&quot;Line2&quot;] line13[&quot;Line3&quot;] line14[&quot;Line4&quot;] end space block:cache2 line21[&quot;Line1&quot;] line22[&quot;Line2&quot;] line23[&quot;Line3&quot;] line24[&quot;Line4&quot;] end block:memory:3 columns 4 A[&quot;64 Byte&quot;] B[&quot;...&quot;] block:C space X space Y end D[&quot;...&quot;] end thread1 --&gt; line14 thread2 --&gt; line22 X --&gt; line14 Y --&gt; line22 classDef core0 fill:#e0eafd,stroke:#777; classDef core1 fill:#fee3df,stroke:#777; classDef blk fill:#eee,stroke:#999; class thread1,line14,X core0 class thread2,line22,Y core1 class C blk For such cases, you can force the structure alignment to the length of a cache line through certain means: Ensure that the starting address of the allocated memory block is a multiple of 64. By padding with placeholder fields, ensure that the structure occupies an entire cache line. GCC provides the aligned compilation attribute to control memory alignment: struct strtype variable __attribute((aligned(64))); struct strtype { ...members... } __attribute((aligned(64))); For managed languages like Java, you would need to manually pad fields or use special annotations to instruct the JVM to align the structure to cache lines. Virtual Memory When a process starts, the operating system allocates a contiguous virtual memory space for it: When users allocate memory using malloc(), a segment of this space is mapped to specific physical memory. Accessing an unassociated virtual address results in a segmentation fault exception. This design significantly enhances machine resource utilization at the expense of some access efficiency: Each process dynamically allocates physical memory during runtime, avoiding space wastage. It allows memory overcommitment through swap operations, enabling more processes to run concurrently on a machine. The size of the virtual space is directly related to the length of virtual addresses: With a 32-bit virtual address length, the addressable virtual memory space is 4GB. With a 64-bit virtual address length, the theoretically addressable virtual memory space is 16EB. The entire virtual address is divided into kernel space and user space, with user space mainly comprising two parts: Executable File Loading This part of the layout is related to the executable files generated by the compiler, typically consisting of three sections: Code contains machine instructions generated by the compiler, derived from the .text segment of the executable file. Data contains global static variables and their corresponding initial values, derived from the .data and .rodata segments of the executable file. BSS contains global static variables without initialized values, derived from the .bss segment of the executable file. The distinction between Data and BSS primarily aims to reduce the size of the executable file and avoid unnecessary initial zero values. To ensure system security, different memory segments correspond to different physical memory pages with different permissions: section execute read write .text ✔ ✔ ✘ .rodata ✘ ✔ ✘ .data ✘ ✔ ✔ .bss ✘ ✔ ✔ Runtime Dynamic Creation Stack The stack is used to store local variables in functions, managed by stack frames: Memory for a stack frame is automatically allocated on the heap when a function is called, or programmers can allocate space on the stack frame using alloca(). When a function returns, the stack frame is destroyed, and its associated memory is reclaimed. Heap The heap is used to store complex data structures that require dynamic allocation, such as linked lists and binary search trees.Programmers allocate and free space on the heap using malloc() and free() and are responsible for managing the memory lifecycle. For flexibility, in most cases, the stack and heap areas of a process are allocated at opposite ends of the address space.This layout allows either side to grow as much as possible. Shared Memory The memory space between them can be used to implement shared memory, facilitating memory sharing with other processes.Leveraging memory sharing, two common functionalities can be achieved: Inter-process communication by mapping virtual addresses of different processes to the same physical memory. By default, the virtual address of child processes shares the physical memory associated with the parent process, creating a new copy only when the child process modifies data. At this point, have you ever considered the question: Where should JIT compiler-generated code reside?: Attempting to write it into the region where .text resides would cause the process to crash. Simply placing it on the heap would also result in a process crash when the instruction flow jumps to that memory page. To address this issue, the JIT compiler needs to perform the following operations: Directly allocate exclusive physical memory pages on the heap through system calls, avoiding malloc() from allocating memory for this page. Set the permissions of this memory page to executable and copy the generated code into it. Memory Paging The smallest unit of memory allocated by the operating system is a page. The CPU maps virtual addresses to physical addresses through the Memory Management Unit (MMU).This mapping process involves a structure called a page table: A Level-1 page table divides the virtual memory address into two parts: index: The page associated with this virtual address offset: The offset of the corresponding physical address within the page The size of a memory page determines the length of the offset part: When the page size is 4MB, the length of the offset part is 22 bits. When the page size is 4KB, the length of the offset part is 12 bits. In a 32-bit system using 4kB memory pages, 20 bits are used to represent the index.If a continuous pointer array is used as the page table, the page table occupies space of up to $2^{20} \\times 4$ = 4MB. Due to the isolation of virtual memory, the OS maintains separate page tables for each process, which undoubtedly wastes a lot of memory space.To save space, multi-level page tables are usually used to maintain mapping relationships and achieve on-demand space allocation. Continuous virtual addresses can share high-level page table space, reducing unnecessary memory allocation.To reduce the memory space occupied by page tables, memory should be allocated on continuous virtual addresses as much as possible. However, this multi-level random access method performs poorly in terms of addressing efficiency. To improve addressing efficiency, modern CPUs introduce a hardware cache called the Translation Look-Aside Buffer (TLB), which caches the mapping relationship between index and page, avoiding the performance overhead of multiple jumps: When lookup hits, the physical address is calculated based on the page and offset. When lookup misses, the result is queried in the main memory&#39;s multi-level page table and cached in the TLB. The hit rate of the TLB directly affects memory access performance, and optimization methods can be divided into two categories: Reduce Context Switching Processes running on the same CPU core share TLB resources. However, because the virtual addresses between different processes are isolated, the same index points to different pages. Every time a context switch occurs, the TLB cache records need to be cleared. To improve cache hit rates, some CPUs add tag bits to the cache index at the hardware level to distinguish between different virtual address spaces: Distinguish between kernel space and user space, eliminating the need to clear the TLB on system calls. Distinguish between host space and virtual machine space, improving virtual machine execution efficiency. Distinguish between different process address spaces, eliminating the need to clear the TLB on process switches. Allocate Memory on Continuous Virtual Addresses By allocating memory on continuous addresses, unnecessary page table allocation can be avoided, reducing the memory space occupied by page tables. This also ensures that hot memory pages reside in the TLB, reducing page table lookups. Multithreading Optimization Threads form the same process share the same address space. When threads do not share any memory data, each different CPU core has independent cache lines in its L1. When threads concurrently modify the cache, there is no need to send RFO requests, and there is no additional communication overhead. When multiple threads access the same memory location, the L1s of different CPU cores need to communicate with each other to ensure cache consistency. When threads concurrently modify the cache, a large number of RFO requests are sent, meaning that only one CPU can operate its cache line copy at a time. Any other CPU cores will be delayed and unable to do anything. The more concurrent threads there are, the greater the synchronization overhead, and each additional processor will only bring more delay. To avoid this problem, read-write separation is required for data shared between multiple threads: group constants or variables that are only initialized once together. When this data is loaded into the cache, the cache line in which it resides will be in the S state for a long time and will not be affected by modification operations. One way to do this is to use the const keyword to modify global static variables. When GCC compiles to generate an executable file: Variables modified with const will be placed in the .rodata section Variables without const will be placed in the .data section At runtime, variables in the same section are loaded into adjacent contiguous memory segments. Memory segments of different sections are independent of each other and do not affect each other. When the const keyword not available, custom sections can be defined using compiler attributes: int foo = 1; int bar __attribute__((section(&quot;.data.ro&quot;))) = 2; int baz = 3; int xyzzy __attribute__((section(&quot;.data.ro&quot;))) = 4; NUMA Hypercube In a traditional Uniform Memory Access (UMA) system, all CPUs access memory through a unified Front Side Bus (FSB). The FSB connects the CPU to the Northbridge chipset, which then connects to the memory controller, and all memory accesses are made through this bus. The consistency of this architecture ensures that CPU access times to all memory locations are consistent. In contrast, in Non-Uniform Memory Access (NUMA) system, the memory control functionality of the Northbridge chipset is integrated into the CPU. To address the memory access latency caused by FBS, NUMA divides memory into multiple nodes, each typically containing a group of CPU cores and closely associated memory. NUMA is an excellent example of applying divide-and-conquer techniques to solve complex problems. In this architecture, CPUs can directly communicate with local memory within the same node without going through a unified memory bus. Because the bus bandwidth is no longer a system bottleneck, this architecture has greater horizontal scalability, making it particularly suitable for large-scale commercial hardware. To accurately describe the communication topology between CPUs and memory, a structure called a hypercube is introduced: Each node contains a group of processor cores and their corresponding local memory. These nodes are interconnected through a communication network, forming a highly parallel system. Each node comprises processor cores and local memory, interconnected via a communication network, forming a highly parallel system. With $C$ interconnections per node, the hypercube can accommodate up to $2^C$ interconnected nodes, with the maximum distance between any two nodes being C. This topology offers several advantages: Locality: Binding cores to local memory improves locality and reduces remote memory access latency. Scalability: Easy horizontal scaling avoids bottlenecks, enhancing overall system performance. Fault Tolerance: Distributed symmetry facilitates redundancy and backup, improving system reliability. OS Optimization NUMA introduces non-uniform memory access delays, depending on the memory location and the executing CPU core. Each core accesses its local node memory faster, while remote node access is slower. To manage this complexity, OS require NUMA awareness to optimize memory allocation and data layout, minimizing cross-node access latency. Shared Memory Management Shared libraries like libc.so typically reside in a specific set of physical memory pages. This implies remote access for most processors. A NUMA-friendly OS maintains independent libc.so copies per node, avoiding frequent remote access. Thread Scheduling During thread scheduling, ensure threads and their frequently accessed memory reside on the same NUMA node. Avoid frequent cross-node thread migrations. Consider memory, not just CPU load, during thread scheduling. Distribute memory-intensive threads across nodes to prevent memory exhaustion on specific nodes. Programming Optimization NUMA programming optimization techniques align with those for UMA: Large Contiguous Memory Access: Utilize local caches to mitigate remote access overhead by accessing large contiguous memory blocks. Pre-configure CPU Core and Thread Affinity: Avoid cross-NUMA node thread scheduling by pre-configuring CPU core and thread affinity. "},{"slug":"smid-varint-with-ffm-api","title":"Call SIMD Native Functions with FFM API","tags":["BackendDev","NumericalEncoding"],"content":"The recently released JDK 22 includes a stable version of the Foreign-Memory Access API, which is officially claimed to provide better performance than JNI. It just so happens that I have been thinking about using SIMD to optimize encoding algorithms recently, so I am ready to combine these two technologies to optimize VarInt encoding. SIMD Introduction Multithreading and concurrency can effectively improve application throughput, but in essence, each thread is only executing a serial stream of instructions. When there are too many threads, frequent context switching will consume hardware resources, which will ultimately lead to performance degradation instead of improvement. SIMD is an optimization method for parallel computing implemented at the hardware level: after putting 4 32bit values into a 128bit register, 4 calculations can be performed with one instruction. In addition to the need to transfer data between registers and memory, SIMD itself does not introduce any performance loss, and it is a reliable optimization method. block-beta columns 14 block:32bit[&quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;＋&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;4 x 32bit&quot;]:4 columns 4 1 2 3 4 space:4 5 6 7 8 end to1&lt;[&quot; &quot;]&gt;(right) block:128bit[&quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;＋&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;128bit&quot;]:4 columns 1 a[&quot;1&amp;emsp;&amp;emsp;2&amp;emsp;&amp;emsp;3&amp;emsp;&amp;emsp;4&quot;] space b[&quot;5&amp;emsp;&amp;emsp;6&amp;emsp;&amp;emsp;7&amp;emsp;&amp;emsp;8&quot;] end to2&lt;[&quot; &quot;]&gt;(right) block:sum[&quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;128bit&quot;]:4 columns 1 space c[&quot;6&amp;emsp;&amp;emsp;8&amp;emsp;&amp;emsp;10&amp;emsp;&amp;emsp;12&quot;] space end classDef empty color:#777,stroke-width:0px,fill:none class 32bit,128bit,sum empty SIMD can not only improve the CPU computing power per unit time, but also improve the hit rate of the L1 cache due to the reduction in the number of instructions and the batch loading of data, which further improves the computing efficiency. Instruction Set Some methods in JDK with the @IntrinsicCandidate annotation will be replaced with specific assembly implementations inside the JVM. For example, Long.numberOfLeadingZeros and Long.numberOfTrailingZeros may be replaced with lzcnt and tzcnt instructions. Whether or not the replacement can be done depends on whether the CPU on which the JVM is running supports these instructions. In addition to some special instructions (e.g. popcnt), most instructions exist in the form of instruction sets, and you can check the instruction sets supported by the CPU through lscpu: $ lscpu | grep Flags Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities This VarInt optimization mainly involves two instruction sets: BMI (Bit Manipulation Instructions) _pdep_u64: Split 64 bits into groups of 7 bits each SSE (Streaming SIMD Extensions) _mm_set_epi8 : Loads 16 8-bit data into a 128-bit register _mm_cmpgt_epi8 : Compares data in registers by groups of 8 bits _mm_movemask_epi8 : Determines whether each group of 8 bits is non-zero and aggregates the result into an int (only the lowest 16 bits are valid) _mm_bsrli_si128 : Right shift by groups of 8 bits Code Implementation Different compilers support assembly instructions in different ways. In order to ensure the portability of SMID code and coding efficiency, modern operating systems provide intrinsic functions in their standard C library functions. Each SMID instruction has its corresponding intrinsic function, which allows developers to use SMID assembly instructions through high-level languages such as C/C++. The actual code that will be used next will also use intrinsic functions to implement, from varint-simd 。 If you are interested in specific instructions, please check it out here. #include &lt;stdio.h&gt; #include &lt;immintrin.h&gt; #include &lt;x86intrin.h&gt; typedef unsigned char u8; typedef unsigned short u16; typedef unsigned long u64; void print_out(u8 val[]) { printf(&quot;out: %x %x %x %x %x %x %x %x %x %x %x %x \\n&quot;, val[0], val[1], val[2], val[3], val[4], val[5], val[6], val[7], val[8], val[9], val[10], val[11]); } int encode_varint64(u64 val, u8 out[]) { // Break the number into 7-bit parts and spread them out into a vector __m128i stage1 = _mm_set_epi64x( _pdep_u64(val &gt;&gt; 56, 0x000000000000017f), _pdep_u64(val, 0x7f7f7f7f7f7f7f7f)); // Create a mask for where there exist values // This signed comparison works because all MSBs should be cleared at this point // Also handle the special case when num == 0 __m128i minimum = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xFF); __m128i exists = _mm_or_si128(_mm_cmpgt_epi8(stage1, _mm_setzero_si128()), minimum); // Count the number of bytes used int bits = _mm_movemask_epi8(exists); u8 bytes = 1 + _bit_scan_reverse(bits); // Fill that many bytes into a vector __m128i ascend = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); __m128i mask = _mm_cmplt_epi8(ascend, _mm_set1_epi8(bytes)); // Shift it down 1 byte so the last MSB is the only one set, and make sure only the MSB is set __m128i shift = _mm_bsrli_si128(mask, 1); __m128i msbmask = _mm_and_si128(shift, _mm_set1_epi8((u8)0x80)); // Merge the MSB bits into the vector __m128i merged = _mm_or_si128(stage1, msbmask); _mm_storeu_si128((__m128i*) out, merged); return bytes; } int main(int argc, char** argv) { u8 out[16]; encode_varint64(127, out); print_out(out); encode_varint64(128, out); print_out(out); encode_varint64(16383, out); print_out(out); encode_varint64(16384, out); print_out(out); encode_varint64(~0, out); print_out(out); return 0; } Execute gcc -mbmi2 -o test test.c &amp;&amp; ./test to run the code and you will get the following results: out: 7f 0 0 0 0 0 0 0 0 0 0 0 out: 80 1 0 0 0 0 0 0 0 0 0 0 out: ff 7f 0 0 0 0 0 0 0 0 0 0 out: 80 80 1 0 0 0 0 0 0 0 0 0 out: ff ff ff ff ff ff ff ff ff 1 0 0 FFM Wrapper Introduction The Foreign-Memory Access API (FFM), previously known as the Panama project, has been officially released as a standard feature in JDK 22. Panama aims to establish a unified and efficient calling mechanism between the JVM and native code. This includes the Vector API proposal for encapsulating SIMD instructions. FFM provides two important APIs: Foreign-Memory Access API: A heap-based memory management mechanism based on Aerna and MemorySegment. Foreign Linker API: A dynamic library calling mechanism based on Linker and MethodHandle. The Foreign-Memory Access API is intended to replace Unsafe and become the standard heap-based memory manager in the future. Due to safety considerations, it includes a large number of boundary and lifecycle checks, which may result in a performance decrease compared to direct use of Unsafe. Unfortunately, in order to promote the new API, the official has proposed a draft to deprecate Unsafe. The Foreign Linker API can be seen as the successor of JNI (Java Native Interface), but with a significant leap in ease of use. Users can directly access existing dynamic libraries without writing any C/C++ code: public class FFMTestStrlen { public static void main(String[] args) throws Throwable { // 1. Get a linker – the central element for accessing foreign functions Linker linker = Linker.nativeLinker(); // 2. Get a lookup object for commonly used libraries SymbolLookup stdlib = linker.defaultLookup(); // 3. Get the address of the &quot;strlen&quot; function in the C standard library MemorySegment strlenAddress = stdlib.find(&quot;strlen&quot;).orElseThrow(); // 4. Define the input and output parameters of the &quot;strlen&quot; function FunctionDescriptor descriptor = FunctionDescriptor.of(ValueLayout.JAVA_LONG, ValueLayout.ADDRESS); // 5. Get a handle to the &quot;strlen&quot; function MethodHandle strlen = linker.downcallHandle(strlenAddress, descriptor); // 6. Get a confined memory area (one that we can close explicitly) try (Arena offHeap = Arena.ofConfined()) { // 7. Convert the Java String to a C string and store it in off-heap memory MemorySegment str = offHeap.allocateFrom(&quot;Happy Coding!&quot;); // 8. Invoke the &quot;strlen&quot; function long len = (long) strlen.invoke(str); System.out.println(&quot;len = &quot; + len); } // 9. Off-heap memory is deallocated at end of try-with-resources } } If you don’t want to manually maintain a large amount of template code, you can try using the code generation tool jextract provided by FFM. Implementation Steps First, write C code and generate a dynamic link library. This step is very simple and does not involve any Java-related dependencies. Compilation gcc -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -O3 -mbmi2 varint64.c -shared -o /tmp/varint64.so -fPIC Code #include &lt;immintrin.h&gt; #include &lt;x86intrin.h&gt; extern int encodeLong(unsigned long value, unsigned char out[]) { // Break the number into 7-bit parts and spread them out into a vector unsigned long a = _pdep_u64(value, 0x7f7f7f7f7f7f7f7f); unsigned long b = _pdep_u64(value &gt;&gt; 56, 0x000000000000017f); __m128i stage1 = _mm_set_epi64x( _pdep_u64(value &gt;&gt; 56, 0x000000000000017f), _pdep_u64(value, 0x7f7f7f7f7f7f7f7f)); // Create a mask for where there exist values // This signed comparison works because all MSBs should be cleared at this point // Also handle the special case when num == 0 __m128i minimum = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xFF); __m128i exists = _mm_or_si128(_mm_cmpgt_epi8(stage1, _mm_setzero_si128()), minimum); // Count the number of bytes used int bits = _mm_movemask_epi8(exists); unsigned char bytes = 1 + _bit_scan_reverse(bits); // Fill that many bytes into a vector __m128i ascend = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); __m128i mask = _mm_cmplt_epi8(ascend, _mm_set1_epi8(bytes)); // Shift it down 1 byte so the last MSB is the only one set, and make sure only the MSB is set __m128i shift = _mm_bsrli_si128(mask, 1); __m128i msbmask = _mm_and_si128(shift, _mm_set1_epi8((unsigned char)0x80)); // Merge the MSB bits into the vector __m128i merged = _mm_or_si128(stage1, msbmask); _mm_storeu_si128((__m128i*) out, merged); return bytes; } Then use FFM to write glue code as the calling stub of the link library. public class VarIntFFM { static final Arena LIBRARY_ARENA; static final SymbolLookup SYMBOL_LOOKUP; public static final FunctionDescriptor DESC = FunctionDescriptor.of( ValueLayout.JAVA_INT, ValueLayout.JAVA_LONG, ValueLayout.ADDRESS.withTargetLayout(MemoryLayout.sequenceLayout(Long.MAX_VALUE, JAVA_BYTE)) ); public static final MemorySegment ADDR; public static final MethodHandle HANDLE; static { try { LIBRARY_ARENA = Arena.ofAuto(); SYMBOL_LOOKUP = SymbolLookup.libraryLookup(Paths.get(&quot;/tmp/varint64.so&quot;), LIBRARY_ARENA) .or(SymbolLookup.loaderLookup()) .or(Linker.nativeLinker().defaultLookup()); ADDR = SYMBOL_LOOKUP.find(&quot;encodeLong&quot;).orElseThrow(); HANDLE = Linker.nativeLinker().downcallHandle(ADDR, DESC); } catch (Throwable e) { e.printStackTrace(); throw new RuntimeException(e); } } public static int encodeLong(long value, MemorySegment out) throws Throwable { return (int) HANDLE.invokeExact(value, out); } } Benchmark Finally, benchmark with JMH by simulating the following scenario: Call native function which writes the encoded data into the intermediate array buf, and then copy the data in buf into the result array which will be return to user. public class VarIntPerf { @Benchmark @Warmup(time = 3, iterations = 3) @Measurement(time = 5, iterations = 3) public void testNative() throws Throwable { byte[] result = new byte[10]; try (Arena offHeap = Arena.ofConfined()) { MemorySegment buf = offHeap.allocate(16, 16); for (int i=0; i&lt;1000; i++) { int n = TestFFM.encodeLong(i, buf); Unsafe.COPY_MEMORY.copyMemory(null, buf.address(), result, CodecSlice.BYTES_OFFSET, n); } } } @Benchmark @Warmup(time = 3, iterations = 3) @Measurement(time = 5, iterations = 3) public void testPlain() { byte[] result = new byte[10]; byte[] buf = new byte[10]; for (int i=0; i&lt;1000; i++) { int n = VarInt.encodeLong(i, buf, 0); System.arraycopy(buf, 0, result, 0, n); } } public static void main(String[] args) throws RunnerException { Options opt = new OptionsBuilder() .include(VarIntFFMPerf.class.getSimpleName()) .forks(1) .build(); new Runner(opt).run(); } } Here we use maven-shade-plugin to generate benchmarks.jar : &lt;plugin&gt; &lt;groupId&gt;org.apache.maven.plugins&lt;/groupId&gt; &lt;artifactId&gt;maven-shade-plugin&lt;/artifactId&gt; &lt;version&gt;3.2.0&lt;/version&gt; &lt;executions&gt; &lt;execution&gt; &lt;phase&gt;package&lt;/phase&gt; &lt;goals&gt; &lt;goal&gt;shade&lt;/goal&gt; &lt;/goals&gt; &lt;configuration&gt; &lt;finalName&gt;benchmarks&lt;/finalName&gt; &lt;transformers&gt; &lt;transformer implementation=&quot;org.apache.maven.plugins.shade.resource.ManifestResourceTransformer&quot;&gt; &lt;mainClass&gt;org.openjdk.jmh.Main&lt;/mainClass&gt; &lt;/transformer&gt; &lt;/transformers&gt; &lt;/configuration&gt; &lt;/execution&gt; &lt;/executions&gt; &lt;/plugin&gt; Finally, execute the following command to start bencmarking: mvn clean package java -jar target/benchmarks.jar VarIntFFM Here is the results: Benchmark Mode Cnt Score Error Units VarIntPerf.testFFM thrpt 15 47364.288 ± 1190.314 ops/s VarIntPerf.testPlain thrpt 15 109338.097 ± 973.650 ops/s Have to say that this result is quite disappointing. The overhead of FFM calling local code is still considerable, and its context switching overhead is still not of the same order of magnitude as JVM internal calls. JNI Wrapper Next we will use JNI to implement the same native function and use it as a performance comparison. Implementation Steps First, create a java class as the call stub. public abstract class VarIntJNI { static { System.load(&quot;/tmp/varintjni.so&quot;); } public static native int encodeLong(long value, long pointer); } Then execute javac -h . VarIntJNI.java to generate JNI header file. /* DO NOT EDIT THIS FILE - it is machine generated */ #include &lt;jni.h&gt; /* Header for class VarIntJNI */ #ifndef _Included_VarIntJNI #define _Included_VarIntJNI #ifdef __cplusplus extern &quot;C&quot; { #endif /* * Class: VarIntJNI * Method: encodeLong * Signature: (JJ)I */ JNIEXPORT jint JNICALL Java_VarIntJNI_encodeLong (JNIEnv *, jclass, jlong, jlong); #ifdef __cplusplus } #endif #endif Finally, compile the JNI version shared library. Compilation gcc -I $JAVA_HOME/include -I $JAVA_HOME/include/linux/ -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -O3 -mbmi2 varintjni.c -shared -o /tmp/varintjni.so -fPIC Code #include &lt;immintrin.h&gt; #include &lt;x86intrin.h&gt; #include &quot;varintjni.h&quot; JNIEXPORT jint JNICALL Java_VarIntJNI_encodeLong (JNIEnv * env, jclass obj, jlong value, jlong out) { // Break the number into 7-bit parts and spread them out into a vector unsigned long a = _pdep_u64(value, 0x7f7f7f7f7f7f7f7f); unsigned long b = _pdep_u64(value &gt;&gt; 56, 0x000000000000017f); __m128i stage1 = _mm_set_epi64x( _pdep_u64(value &gt;&gt; 56, 0x000000000000017f), _pdep_u64(value, 0x7f7f7f7f7f7f7f7f)); // Create a mask for where there exist values // This signed comparison works because all MSBs should be cleared at this point // Also handle the special case when num == 0 __m128i minimum = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0xFF); __m128i exists = _mm_or_si128(_mm_cmpgt_epi8(stage1, _mm_setzero_si128()), minimum); // Count the number of bytes used int bits = _mm_movemask_epi8(exists); unsigned char bytes = 1 + _bit_scan_reverse(bits); // Fill that many bytes into a vector __m128i ascend = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); __m128i mask = _mm_cmplt_epi8(ascend, _mm_set1_epi8(bytes)); // Shift it down 1 byte so the last MSB is the only one set, and make sure only the MSB is set __m128i shift = _mm_bsrli_si128(mask, 1); __m128i msbmask = _mm_and_si128(shift, _mm_set1_epi8((unsigned char)0x80)); // Merge the MSB bits into the vector __m128i merged = _mm_or_si128(stage1, msbmask); _mm_storeu_si128((__m128i*) out, merged); return bytes; } Benchmark To ensure fairness, we&#39;ll directly use Unsafe to allocate off-heap memory for communication, avoiding potential memory copying caused by JNI. public class VarIntPerf { static final sun.misc.Unsafe unsafe = (sun.misc.Unsafe) Unsafe.UNSAFE; @Benchmark @Warmup(time = 3, iterations = 3) @Measurement(time = 3, iterations = 3) public void testJNI() throws Throwable { byte[] result = new byte[16]; long address = unsafe.allocateMemory(16); for (int i=0; i&lt;1000; i++) { int n = VarIntJNI.encodeLong(i, address); Unsafe.COPY_MEMORY.copyMemory(null, address, result, CodecSlice.BYTES_OFFSET, n); } unsafe.freeMemory(address); } //... public static void main(String[] args) throws RunnerException { Options opt = new OptionsBuilder() .include(VarIntJNIPerf.class.getSimpleName()) .forks(1) .build(); new Runner(opt).run(); } } Here is the results: Benchmark Mode Cnt Score Error Units VarIntPerf.testJNI thrpt 15 46634.574 ± 904.912 ops/s VarIntPerf.testFFM thrpt 15 44151.628 ± 815.886 ops/s VarIntPerf.testPlain thrpt 15 109448.067 ± 1382.802 ops/s While the performance of the JNI wrapper is still not ideal, it still has a slight advantage over the FFM wrapper approach, which is consistent with the official documentation&#39;s expectations. The call paths for FFM and JNI can be divided into two categories: upcall: Calling Java code from native methods downcall: Calling native methods from Java code FFM mainly optimizes the upcall path, while the optimization for downcall is not very significant. This test only includes the downcall path, so it cannot fully demonstrate FFM&#39;s advantage. Conclusion With the advent of FFM, accessing native code in the JVM is becoming increasingly easier. However, the context switching overhead of FFM is still relatively high, so it is not suitable for high-frequency calls. There are two possible optimization methods: Process as much data as possible in a single call. But it means more memory allocation and data copying overhead. Implement as much function as possible using native code. But it means that the memory safety and ease of debugging provided by JVM will no longer exist. As a means to eliminate computing hotspots, SIMD is typically only suitable for optimizing hotspot code. This is exactly contrary to the design philosophy of FFM and JNI. If you want to experience the acceleration ability of SIMD in the JVM language, it seems that you can only pin your hopes on the Vector API, which is not yet mature. "},{"slug":"variable-length-numeric-compression","title":"Variable-Length Numerical Compression Encoding","tags":["BackendDev","NumericalEncoding"],"content":"In the vast world of compression encoding, a specific category exists for numerical compression. These encodings do not require constructing a data dictionary based on sample data; instead, they can efficiently compress real-time time-series data without loss. These compression encodings have shone brightly in the field of time-series data processing, saving users a significant amount of storage space and transmission bandwidth while enhancing the processing capabilities for large amounts of data. This article will introduce some of the most commonly used encoding methods among them. Basic Concepts In the modern internet era, we constantly benefit from the convenience brought by compression encodings. According to whether there&#39;s a loss of precision after decompression, compression algorithms can be divided into two categories: Lossy Compression: Reduces the volume of image and audio files by eliminating details through signal processing technical such as filtering and transformations. Lossless Compression: Eliminates redundancy in data by scanning and constructing data a dictionary then replace the orignal data with the indices of dictionary. However, these two methods are not optimal solutions for time-series data. Time-series data possesses the following characteristics: High redundancy Small sample sizes High real-time requirements Such data is ubiquitous in our surroundings, such as the stock order book data shown in the image below: If conventional lossless compression algorithms are applied to this data, not only will compression efficiency not be achieved, but the introduction of compression dictionaries may also lead to data inflation. For special data, special compression method is required — Variable Length Numeric Encoding. Variable Length Numeric Encoding is a type of encoding format specifically designed to handle small-range numerical values. Due to its high compression efficiency and simple implementation, it has been widely used in search engines and columnar storage fields: Protobuf achieves efficient binary serialization through variable-length integer encoding. The time-series database InfluxDB utilizes variable-length floating-point encoding to achieve efficient data storage. Next, we will review two commonly used encoding formats: Integer Encoding Varint: Unsigned small integers ZigZag: Signed small integers Simple8b: Repeated small integers Floating-Point Encoding Gorilla: Floating-point numbers close in value Chimp: Floating-point numbers with periodic regularity Integer Encoding Space Waste Commonly used integer data types can be divided into: int8 / byte int16 / short int32 / int int64 / long The larger the space occupied, the wider the range that can be expressed, but it also implies higher storage overhead. Taking the volume in the order book data as an example: In popular stocks, there are many active retail investors, so the volume is usually small, requiring only int16 for storage. In less popular stocks, there are fewer active investors, and the volume may be in the thousands, requiring int32 for storage. However, in practical applications, it is not feasible to model data for specific stocks. To avoid overflow, int32 must be selected to represent volume, which undoubtedly consumes more storage space and network bandwidth. Varint To find a more versatile representation, VarInt encoding emerged. For int32 data, its encoding rules are as follows: Divide the 32-bit data into n groups of 7 bits. Each group of data is stored using one byte. The first bit is a flag bit indicating whether the encoding is complete. The remaining 7 bits are data bits, storing the data after grouping. 12 → 1100 → 00001100 (1 byte) 289 → 1|00100001 → 00000010|10100001 (2 byte) 65990 → 1|00000001|11000110 → 00001000|0000011|11000110 (3 byte) Through this encoding rule, VarInt can convert int32 into byte arrays of lengths ranging from 1 to 5. Depending on the number of occupied bytes, the data range that the encoding can express is: 1 byte → 0 ~ 127 2 byte → 0 ~ 16383 3 byte → 0 ~ 2097151 4 byte → 0 ~ 268435455 5 byte → 0 ~ 4294967295 Obviously when unsigned small integers occur frequently in the application scenario, using VarInt can achieve decent compression effects. Even occasional large values will not cause overflow. By ensuring storage and transmission efficiency while improving the generality of data models. ZigZag However, VarInt also has a significant drawback: it is not friendly to negative numbers. For big-endian data, VarInt achieves compression by reducing leading zeroes. However, the first bit of negative numbers is always nonzero, making it impossible to compress the data and introducing unnecessary space overhead. Taking -1 as an example, its two&#39;s complement corresponding unsigned integer for int32 is 4294967295, which means it needs 5 bytes to encode. For negative integers, a better sloution is ZigZag. The ZigZag encoding solves this problem by adjusting the two&#39;s complement before VarInt encoding to maximize its leading zeroes. The implementation is not complicated; it only requires two mapping operations before and after VarInt encoding: Mapping before VarInt encoding: (n &lt;&lt; 1) ^ (n &gt;&gt; 31) Mapping after VarInt decoding: (n &gt;&gt;&gt; 1) ^ -(n &amp; 1) Let&#39;s intuitively feel the actual effect of these two operations using -1 as an example: When n = -1, mapping before VarInt encoding: n = -1 -&gt; 11111111111111111111111111111111 a = n &lt;&lt; 1 -&gt; 11111111111111111111111111111110 b = n &gt;&gt; 31 -&gt; 11111111111111111111111111111111 a ^ b -&gt; 00000000000000000000000000000001 When n = -1, mapping after VarInt decoding: m = a ^ b -&gt; 00000000000000000000000000000001 a = m &gt;&gt;&gt; 1 -&gt; 00000000000000000000000000000000 b = -(m &amp; 1) -&gt; 11111111111111111111111111111111 a ^ b -&gt; 11111111111111111111111111111111 ZigZag mapping effectively increases the number of leading zeroes for small negative numbers, thereby improving compression efficiency. However, after adding ZigZag mapping, the data range that the encoding can express becomes: 1 byte → -64 ~ 63 2 byte → -8192 ~ 8191 3 byte → -1048576 ~ 1048575 4 byte → -134217728 ~ 134217727 5 byte → -2147483648 ~ 2147483647 We can see that ZigZag encoding makes a trade-off: by occupying part of the encoding space for non-negative numbers, it greatly improves the compression effect for small negative values. Simple8 The two encoding methods mentioned above still have a limitation: only one numerical value can be stored in one byte. To overcome this limitation, the work packing encodings called emerged: by packing multiple integer data according to certain rules into a more compact form to save storage space. One of these encodings is Simple8b. Simple8 divides 64 bits into two parts: selector(4bit) Specifies the number of integers and the length of effective bits in the remaining 60 bits. payload(60bit) Used to store multiple fixed-length integers. The following encoding table shows the storage conditions of Simple8b with different selector values: selector value 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 integers coded 240 120 60 30 20 15 12 10 9 8 7 6 5 4 3 2 bits per integer 0 0 1 2 3 4 5 6 7 8 10 12 15 20 30 60 Simple8&#39;s decoding method is simple and efficient, requiring only bitwise operations according to the table. However, its encoding process is relatively complex, traditionally implemented using backtracking, resulting in high time complexity. There is a more efficient implementation approach: Enumerate all possible combinations in a 60 x 240 matrix and then prune impossible states, finally obtaining a one-dimensional lookup table with a length of 261. The table lookup method reduces the encoding complexity from $O(N^2)$ to $O(N)$, achieving about 5 times the throughput improvement in our applicaion. Floating-Point Encoding Gorilla Now, let&#39;s talk about encoding methods for floating-point numbers. Taking IEEE 754 floating-point numbers as an example, their storage structure is mainly divided into 3 parts: $S$: Sign bit $E$: Exponent biased with base 2 $F$: Significant digits of the number The corresponding floating-point number can be represented as: $(-1)^S \\times 2^{(E-127)} \\times 1.F$ Previously introduced encodings like VarInt perform well only when the number has enough leading zeroes. To ensure compression ratio, both the sign and the exponent must be 0. However, this range can only represent float numbers between [0, 1), which is applicable to only a few scenarios. Since individual data points aren&#39;t easy to compress, what about a batch of data points? Engineers at Facebook delved into the their time-series dataset and found a large number of completely identical binary bits in the floating-point data sequence. Inspired by the Delta2 encoding, they invented the Gorilla algorithm, which extracts the differences between adjacent data points through XOR operations, eliminating a large amount of redundant bits. The effect of this algorithm is significant, compressing 2 hours of time-series data to only 1.37 bytes per data point. The entire encoding process is quite simple: Calculate the XOR value of two adjacent data points xor and obtain The number of leading zeroes leading-zero The length of the continuous block block-size If the leading zeroes and block-size of the two consecutive xor values are different, record these two values. If the leading zeroes and block-size of the two consecutive xor values are the same, do not record them. Finally, record the blocks of data with differences. In our application, this algorithm can compress data to 33% of the original size, with compression rates exceeding 60%. Chimp Gorilla encoding is good, but it still has some shortcomings: Using a fixed 5-bit field to record leading zeroes may lead to space wastage. Only considering adjacent data points, it cannot recognize the periodicity in time-series data, resulting in suboptimal compression effects in specific scenarios. To address these issues, an algorithm named Chimp emerged. Its main improvements are twofold: By using a mapping function and adding a 1-bit flag, it reduces the leading zeroes encoding to 3 bits. It only records leading zeroes changes when there are differences between adjacent leading zero counts. To enhance the algorithm&#39;s adaptability, the Chimp research team also provided a variant called ChimpN. This variant adds a sliding window of length N on top of the original algorithm. When calculating XOR values, it can select the reference value with the highest similarity in trailing zeroes from the window, further improving encoding compression efficiency. More... With the widespread adoption of SIMD instructions, a batch of encoding algorithms accelerated by SIMD have emerged, such as integer encoding SIMD-FastPFOR and floating-point encoding ALP. Although these algorithms have achieved good results in their respective fields, their performance advantages are not very significant in scenarios with small sample sizes. Moreover, SIMD&#39;s requirement for contiguous memory blocks makes these implementations difficult to match with Java&#39;s Streaming API. Therefore, we have implemented several commonly used algorithms in Java based on our own application requirements and open-sourced them here. It will be our pleasure if guys could give it a try and provide us your valuable feedback. "},{"slug":"file-versioning-with-minio-and-mysql","title":"Multi-Version File Management System Based on Minio and MySQL","tags":["SystemDesign","BackendDev"],"content":"Recently, I was developing a simple multi-version file management system. However, this project was abandoned due to some reasons. Then, I decided to share how this system be developed, hoping to help others in need. This article mainly introduces how to implement multi-version file storage based on Minio and MySQL. Basic Concepts OSS Object Storage Service (OSS) is a type of file storage service provided by cloud providers for storing large amounts of unstructured data, such as images, videos, and compressed files. Common OSS application scenarios include: Static website hosting Content Delivery Networks (CDNs) Data lake storage layers Data backup Minio is a popular open-source object storage service that is compatible with Amazon S3 and provides a variety of features, including: Fully compatible with S3 interface standards Supports multiple private deployment options (IDC, VPC, K8s) High reliability (multiple copies, erasure coding) High security (access control, object encryption) Rich features (event monitoring, space quotas, multi-version control, hot-cold data separation) In my case, the only downside of Minio is its complex operation and maintenance and high learning curve. CTE Due to the limitations of the data model, RDBMS is not good at retrieving tree-structured data. When you have to do this, there are usually two solutions: Initiate multiple SQL queries, each time querying only one layer of data. Use a varchar-type field to store the path and perform prefix matching when querying. However, the former requires multiple queries and has poor performance. The latter is difficult to maintain consistency: updating the parent node&#39;s path will affect the all its child nodes. Luckly, now we got CTE (Common Table Expressions). CTE are a feature of SQL that allows you to define temporary named result sets. WITH cte_name (column1, column2, ...) AS ( -- CTE query definition SELECT column1, column2, ... FROM your_table WHERE conditions ) -- Main query that can reference the CTE SELECT * FROM cte_name; CTE&#39;s advantages: Good SQL readability, easy to maintain Named result sets in CTE can be accessed repeatedly Easier to generate higher-performance query plans Support recursive queries MySQL 8 introduced support for CTEs, which includes a recursive function that greatly simplifies SQL queries for tree-structured data. System Modeling Core features: Multi-tenant isolation and file sharing Searching and authorizing by file path Version merging and rollback operations Recording change logs erDiagram Asset ||--o{ Changelog : change ResourcePath ||--o{ Changelog : change ResourceCommit ||--o{ Changelog : change ResourceCommit ||--|{ Asset : refer ResourcePath ||..|| ResourceCommit : contains Change Log Due to the requirement of security audit, every operation needs to record change logs. We create a global changelog table with JSON fields to ensure its versatility. This table adopts the design of composite primary keys: Change time: records the time when the operation occurs and ensures sequential insertion. Change ID: a long integer ID pre-generated by the application side, used to ensure uniqueness. The considerations of this design: Generating IDs in the application can achieve asynchronous log recording, reducing transaction time. Convenient to implement a one-to-one relationship between data versions and change logs. CREATE TABLE IF NOT EXISTS `changelog` ( `op_time` timestamp NOT NULL COMMENT &#39;PK&#39;, `op_code` bigint NOT NULL COMMENT &#39;PK&#39;, `op_type` tinyint unsigned NOT NULL COMMENT &#39;Change event type&#39;, `op_json` json NOT NULL DEFAULT (JSON_OBJECT()) COMMENT &#39;Change event json&#39;, PRIMARY KEY (`op_time`,`op_code`) ) ENGINE=InnoDB COMMENT=&#39;Changelog for data modification.&#39; PARTITION BY KEY(`op_time`) PARTITIONS 5; File Deduplication Due to the need to support file sharing, the system may have a large number of identical files. To reduce the storage pressure of Minio, files need to be deduplicated and reused. An asset table is set up to record file metadata, and maintained file lifecycle with reference counting. CREATE TABLE IF NOT EXISTS `asset` ( `id` bigint unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT &#39;PK&#39;, `bucket` varchar(50) NOT NULL COMMENT &#39;Bucket name&#39;, `checksum` varchar(255) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Hash and MD5 summary of the whole file&#39;, `file_size` int unsigned NOT NULL COMMENT &#39;File size (unit: byte)&#39;, `file_url` varchar(255) NOT NULL COMMENT &#39;Cloud file storage path&#39;, `mime_type` varchar(255) NOT NULL DEFAULT &#39;&#39; COMMENT &#39;MimeType&#39;, `extension_name` varchar(255) NOT NULL DEFAULT &#39;&#39; COMMENT &#39;File extension&#39;, `preview` varchar(255) NOT NULL DEFAULT &#39;&#39; COMMENT &#39;Preview Token&#39;, `ref_count` int unsigned NOT NULL DEFAULT 0 COMMENT &#39;Reference counting&#39;, `is_deleted` tinyint unsigned NOT NULL DEFAULT 0 COMMENT &#39;Delete tag(0:No,1:Yes)&#39;, `create_by` bigint NOT NULL COMMENT &#39;Create Code&#39;, `update_by` bigint NOT NULL COMMENT &#39;Last Update Code&#39;, `create_at` timestamp NOT NULL COMMENT &#39;Create Time&#39;, `update_at` timestamp NOT NULL COMMENT &#39;Update Time&#39;, UNIQUE KEY `uniq_bucket_checksum` (`bucket`,`checksum`), KEY `idx_file_url` (`file_url`(20)) ) ENGINE=InnoDB COMMENT=&#39;Basic information of objects stored in OSS.&#39;; Version Control To support merging and reverting between version on broswer page, the backend needs to maintain the dependency relationship between different versions. Each commit version may point to one or more parent nodes, and the change history can be displayed in the form of a DAG (Directed Acyclic Graph). Each version corresponds to a unique file asset, and the reference relationship is maintained by recording the parent version record ID. CREATE TABLE IF NOT EXISTS `resource_commit` ( `id` bigint unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT &#39;PK&#39;, `asset_id` bigint unsigned NOT NULL COMMENT &#39;Asset ID&#39;, `asset_meta` json NOT NULL DEFAULT (JSON_OBJECT()) COMMENT &#39;Asset Meta&#39;, `commit_path` bigint unsigned NOT NULL COMMENT &#39;Resource path associated with this commit&#39;, `commit_root` bigint unsigned NOT NULL COMMENT &#39;Root resource of associated resource&#39;, `commit_msg` varchar(50) NOT NULL COMMENT &#39;Commit message&#39;, `main_parent` bigint unsigned NOT NULL COMMENT &#39;Main parent commit&#39;, -- root commit point to zero `sub_parent` bigint unsigned NOT NULL COMMENT &#39;Sub parent commit resulting from merging&#39;, `ref_count` int unsigned NOT NULL DEFAULT 0 COMMENT &#39;Reference counting&#39;, `is_deleted` tinyint unsigned NOT NULL DEFAULT 0 COMMENT &#39;Delete Tag(0: No, 1: Yes)&#39;, `create_by` bigint NOT NULL COMMENT &#39;Create Code&#39;, `update_by` bigint NOT NULL COMMENT &#39;Last Update Code&#39;, `create_at` timestamp NOT NULL COMMENT &#39;Create Time&#39;, `update_at` timestamp NOT NULL COMMENT &#39;Update Time&#39;, KEY `idx_commit_path` (`commit_path`), KEY `idx_main_parent` (`main_parent`) ) ENGINE=InnoDB COMMENT=&#39;Relationship between resource versions.&#39;; Directory Structure All file entries are stored in directory tree. To isolate resources of different users, we assign individual tree root nodes to each user. Instead of MySQL auto-increment IDs, we uses UUIDs to facilitate subsequent PBAC authentication. The records in the table are divided into directories and files. File records have a HEAD pointer that points to the latest version record. To accelerate querying by file path, this table sets a unique index on the node name. When logically deleting, it is necessary to rename the top-level node to avoid the name being unavailable in the future. After that all its child nodes will become invisible when querying with path. CREATE TABLE IF NOT EXISTS `resource_path` ( `id` bigint unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT &#39;PK&#39;, `space_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Space ID&#39;, `node_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Node ID&#39;, `parent_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Parent ID&#39;, -- root path point to itself `node_name` varchar(50) NOT NULL COMMENT &#39;Node Name&#39;, `node_index` double NOT NULL DEFAULT 0 COMMENT &#39;Sort&#39;, `node_type` tinyint unsigned NOT NULL COMMENT &#39;Type (0:Dir,1:File)&#39;, `fork_root` bigint unsigned NOT NULL COMMENT &#39;Fork Parent ID (maybe from other space)&#39;, `fork_parent` bigint unsigned NOT NULL COMMENT &#39;Fork Root ID (maybe from other space)&#39;, `commit_init` bigint unsigned NOT NULL COMMENT &#39;Initial commit&#39;, -- never change after init `commit_head` bigint unsigned NOT NULL COMMENT &#39;Commit head&#39;, `version_num` int unsigned NOT NULL DEFAULT 0 COMMENT &#39;Total number of commits&#39;, `is_deleted` tinyint unsigned NOT NULL DEFAULT 0 COMMENT &#39;Delete tag(0:No,1:Yes)&#39;, `is_rubbish` tinyint unsigned NOT NULL DEFAULT 0 COMMENT &#39;Recycle Bin Tag (0: No, 1: Yes)&#39;, `create_by` bigint NOT NULL COMMENT &#39;Create Code&#39;, `update_by` bigint NOT NULL COMMENT &#39;Last Update Code&#39;, `create_at` timestamp NOT NULL COMMENT &#39;Create Time&#39;, `update_at` timestamp NOT NULL COMMENT &#39;Update Time&#39;, UNIQUE KEY `uniq_node_id` (`node_id`), UNIQUE KEY `uniq_node_name` (`parent_id`,`node_name`), -- The uniqueness of parent_id is guaranteed by uniq_node_id KEY `idx_space_id` (`space_id`) ) ENGINE=InnoDB COMMENT=&#39;Path of a resource tree node.&#39;; Implementation Details CTE Queries With a given {nodeId}, all child nodes can be queried with following statement: WITH RECURSIVE path_view AS ( SELECT *, {maxDepth} lv FROM resource_path WHERE parent_id = {nodeId} AND is_deleted = 0 UNION ALL SELECT p.*, lv - 1 FROM path_view pv INNER JOIN resource_path p ON p.parent_id = pv.node_id WHERE lv &gt; 0 AND p.id &gt; {rootPk} AND p.is_deleted = 0 ) SELECT * FROM path_view ORDER BY lv DESC To avoid performance issues, we limits the maximum depth of the directory tree to {maxDepth}. Since the root node is inserted before all child nodes, we use the auto-increment ID of the root node {rootPk} to limit the query range. With a give file path, first convert it into a list of node names {pathNames}, and then use the following statement to query all child nodes: WITH RECURSIVE path_view AS ( SELECT *, 0 lv FROM resource_path WHERE id = {rootPk} UNION ALL SELECT p.*, lv + 1 FROM path_view pv INNER JOIN resource_path p ON p.parent_id = pv.node_id WHERE lv &lt; {pathNames.size} AND (p.node_name, lv) IN (({pathNames[1]},1), ({pathNames[2]},2), ...) ) SELECT * FROM path_view Use the auto-increment ID of the root node {rootPk} to specify the starting point of the query. Reduce SQL complexity using the multi-value matching feature of the IN statement. Version Updates File version update operations draw on the concept of Git: Commit: Create a new version and move the HEAD reference (optimistic locking an be achieved with it). @Data public class CommitOp { private Long dataId; // The asset pk refered by this commit private String dataHash; // The expect hash checksum of asset private String dataMeta; // The json metadata for this commit (optional) private String message; // Commit message private Long expectHead; // Optimistic locking with head (optional) private Long mergeCommit; // Specify sub_parent for merge op (optional) } Revert: Rollback to an old version and move the HEAD reference (versions that are no longer referenced can be deleted). @Data public class RevertOp { private Long checkoutCommit; // Set the path head to specific commit private Long deleteCommit; // Delete specific commit (optional) } Fork: Create a new version based on a specific submission. @Data public class ForkOp { private Long commitId; private String commitMsg; } Pre-signed URLs We use the pre-signed mechanism provided by Minio to accelerate file upload and download, allowing the frontend to interact directly with it. Taking file upload as an example, the overall process is as follows: Pre-generate a SignedURL based on the bucket and key. Generate a unique token to cache the bucket and key corresponding to this SignedURL. The frontend uses the SignedURL to upload files to Minio. Check the upload status using the token and delete duplicate files. sequenceDiagram Browser-&gt;&gt;Service: getUploadToken() Service-&gt;&gt;Service: minioCli.getPresignedObjectUrl() Service-&gt;&gt;Browser: Token + SignedURL Browser-&gt;&gt;Minio: File Browser-&gt;&gt;Service: Token Service-&gt;&gt;Minio: minioCli.statObject() "},{"slug":"multitenancy-based-on-pbac","title":"Implementing Multi-Tenancy Isolation with PBAC","tags":["SystemDesign","BackendDev","PBAC"],"content":"After completing the design of the resource directory, the next step is to implement the multi-tenancy feature. Basic Concepts Organizational Structure A clear organizational structure is a prerequisite for efficient collaboration. By defining the responsibilities and power scopes of each team, confusion and conflicts can be avoided, thereby increasing overall work efficiency. Organization typical consists of two types of entities: teams and members. Team: Based on the relationship of responsibilities and duties, different teams can be nested within each other, forming a tree or DAG (Directed Acyclic Graph). Member: Each member can belong to one or more teams. Access Control Model Permission control is the cornerstone of ensuring organizational data security. Currently, there are three common authentication models: Role-Based Access Control (RBAC) stateDiagram direction LR User --&gt; Role Role --&gt; Engine state Engine { PredefinedRules } Engine --&gt; Allow Engine --&gt; Deny Authorization steps: Grant permissions to pre-defined roles Associate teams or members in the organization with specific roles Authentication steps: Get the list of roles owned by the visitor Determine whether the role has permission to access the resource Its advantages are clear and easy to understand, and the authorization process is simple and efficient. The disadvantages are that it is granular and may lead to a combination explosion. Attribute-Based Access Control (ABAC) stateDiagram direction LR UserAttrs --&gt; Engine ResourceAttrs --&gt; Engine EnvContext --&gt; Engine state Engine { PredefinedRules } Engine --&gt; Allow Engine --&gt; Deny Authorization steps: Predefine the required attributes for authorization User attributes (role, department ...) Resource attributes (type, owner ...) Environment context (time, location, device ...) Implement an authorization engine around predefined attributes Maintain authorization rules in the authorization engine as needed Authentication steps: Get user attributes, resource attributes and environment context The authorization engine finds the relevant authorization rules based on the attributes Determine whether the user has access permissions Its advantages are that the authorization granularity is fine, and flexible authorization strategies can be implemented. The disadvantages are that the system is complex and difficult to understand, and the maintenance cost is high. Policy-Based Access Control (PBAC) stateDiagram direction LR state Policy { UserAttrs ResourceAttrs EnvContext Rules } Policy --&gt; Engine Engine --&gt; Allow Engine --&gt; Deny PBAC is very similar to ABAC, with the only difference being that the responsibility for maintaining authorization rules is transferred to the authorizer: The authorization engine provides a language (usually JSON) for describing authorization rules The authorizer implements authorization by writing rules This design avoids the coupling between the authorization engine and the specific rules while retaining the advantages of ABAC, and reduces the system maintenance cost. System Modeling Core features: Each user has their own storage space They can invite other users to join the current space as members, and a user has different member identities in different spaces Space administrators can create teams and roles within the space and assign specific responsibilities to members Space administrators can create Keys (AccessKey + SecretKey) to achieve login-free access Implement flexible authorization strategies Administrators can allocate roles to teams, teams, members, and Keys independently Teams and Keys can inherit access permissions from roles Members can inherit access permissions from teams The expiration date of permissions can be specified erDiagram User ||--|| Space : own User ||--o{ Member : is Space ||--o{ Role : has Space ||--o{ Team : has Space ||--o{ Member : has Space ||--o{ Key : has Role }o--o{ Link : refer Team }o--o{ Link : refer Member }o--o{ Link : refer Key }o--o{ Link : refer Basic Information user and space are the most basic entities of the entire multi-tenancy module, and they cannot be changed once created. CREATE TABLE IF NOT EXISTS `user` ( `id` bigint unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT &#39;PK&#39;, `uuid` varchar(32) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;User ID&#39;, `name` varchar(50) NOT NULL COMMENT &#39;Nick Name&#39;, `mobile` varchar(50) NOT NULL DEFAULT &#39;&#39; COMMENT &#39;Phone Number&#39;, `email` varchar(100) NOT NULL DEFAULT &#39;&#39; COMMENT &#39;Email&#39;, ... UNIQUE KEY `uniq_uuid` (`uuid`), KEY `idx_email` (`email`) ) ENGINE=InnoDB COMMENT=&#39;Basic user information.&#39;; CREATE TABLE IF NOT EXISTS `space` ( `id` bigint unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT &#39;PK&#39;, `space_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Space unique identifier character&#39;, `name` varchar(50) NOT NULL COMMENT &#39;Space Name&#39;, `bucket` varchar(50) NOT NULL COMMENT &#39;Space Bucket&#39;, `props` json NOT NULL DEFAULT (JSON_OBJECT()) COMMENT &#39;Space properties&#39;, ... UNIQUE KEY `uniq_space_id` (`space_id`) ) ENGINE=InnoDB COMMENT=&#39;Workspace for resource isolation.&#39;; Organizational Structure Principals that may exist within a space: unit_role, unit_team, unit_member, unit_key. There are 4 types of permission propagation relationships which can be represented by unit_link: Team -&gt; Member Role -&gt; Team Role -&gt; Member Role -&gt; Key CREATE TABLE IF NOT EXISTS `unit_role` ( `id` bigint unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT &#39;PK&#39;, `space_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Space ID&#39;, `role_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Role ID&#39;, `role_name` varchar(100) NOT NULL COMMENT &#39;Role Name&#39;, `role_index` double NOT NULL DEFAULT 0 COMMENT &#39;Sort&#39;, ... UNIQUE KEY `uniq_role_id` (`role_id`), KEY `idx_space_id` (`space_id`) ) ENGINE=InnoDB COMMENT=&#39;Authorized roles in a certain workspace.&#39;; CREATE TABLE IF NOT EXISTS `unit_team` ( `id` bigint unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT &#39;PK&#39;, `space_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Space ID&#39;, `team_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Team ID&#39;, `parent_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Team ID&#39;, `team_name` varchar(50) NOT NULL COMMENT &#39;Department Name&#39;, `team_index` double NOT NULL DEFAULT 0 COMMENT &#39;Sort&#39;, ... UNIQUE KEY `uniq_team_id` (`team_id`), UNIQUE KEY `uniq_team_name` (`parent_id`,`team_name`), -- The uniqueness of parent_id is guaranteed by uniq_node_id KEY `idx_space_id` (`space_id`) ) ENGINE=InnoDB COMMENT=&#39;Organization information in a certain workspace.&#39;; CREATE TABLE IF NOT EXISTS `unit_member` ( `id` bigint unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT &#39;PK&#39;, `user_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;User UUID&#39;, `space_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Space ID&#39;, `member_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Member ID&#39;, `member_name` varchar(255) NOT NULL COMMENT &#39;Member Name&#39;, `member_index` double NOT NULL DEFAULT 0 COMMENT &#39;Sort&#39;, `job_number` varchar(60) NOT NULL DEFAULT &#39;&#39; COMMENT &#39;Job Number&#39;, `position` varchar(255) NOT NULL DEFAULT &#39;&#39; COMMENT &#39;Position&#39;, ... UNIQUE KEY `uniq_member_id` (`member_id`), KEY `idx_user_id` (`user_id`), KEY `idx_mobile` (`mobile`), KEY `idx_email` (`email`) ) ENGINE=InnoDB COMMENT=&#39;Member information in a certain workspace.&#39;; CREATE TABLE IF NOT EXISTS `unit_key` ( `id` bigint unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT &#39;PK&#39;, `space_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Space ID&#39;, `key_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Key ID&#39;, `key_name` varchar(100) NOT NULL COMMENT &#39;Key Name&#39;, `key_index` double NOT NULL DEFAULT 0 COMMENT &#39;Sort&#39;, `access_key` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Access Key&#39;, `secret_key` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Secret Key&#39;, ... UNIQUE KEY `uniq_key_id` (`key_id`), UNIQUE KEY `uniq_access_key` (`access_key`), KEY `idx_space_id` (`space_id`) ) ENGINE=InnoDB COMMENT=&#39;Access keys in a certain workspace.&#39;; CREATE TABLE IF NOT EXISTS `unit_link` ( `id` bigint unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT &#39;PK&#39;, `space_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Space ID&#39;, `link_index` double NOT NULL DEFAULT 0 COMMENT &#39;Sort&#39;, `link_type` tinyint unsigned NOT NULL COMMENT &#39;Link Type(0: Team-Member, 1: Role-Team, 2: Role-Member, 3:Role-Key)&#39;, `main_unit_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Main Unit(0: Team, 1: Role, 2: Role, 3:Role)&#39;, `sub_unit_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Sub Unit(0: Member, 1: Team, 2: Member, 3:Key)&#39;, ... UNIQUE KEY `uniq_link_pair` (`main_unit_id`,`sub_unit_id`), KEY `idx_space_id` (`space_id`) ) ENGINE=InnoDB COMMENT=&#39;Permission delegation link.&#39;; Authorization Policy Our policy in the PBAC model is defined with reference to AWS IAM and simplified: Rule Effect (Allow / Deny) Authorized Subject (Role, Team, ...) Access Actions (Modify, Delete, ...) Access Resources Condition Expression CREATE TABLE IF NOT EXISTS `policy` ( `id` bigint unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY COMMENT &#39;PK&#39;, `space_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Space ID&#39;, `policy_id` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Policy ID&#39;, `principal_type` tinyint unsigned NOT NULL COMMENT &#39;Principal Type(0: Role, 1: Team, 2: Member, 3: Key, 4: User)&#39;, `resource_type` tinyint unsigned NOT NULL COMMENT &#39;Resource Type&#39;, `principal` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Principal ID&#39;, `resource` varchar(50) COLLATE utf8mb4_bin NOT NULL COMMENT &#39;Resource ID&#39;, `effect` tinyint unsigned NOT NULL COMMENT &#39;Policy effect(0: Deny, 1: Allow)&#39;, `action` json NOT NULL COMMENT &#39;Policy action array&#39;, `condition` json NOT NULL COMMENT &#39;Policy condition object&#39;, ... UNIQUE KEY `uniq_policy_id` (`policy_id`), KEY `idx_space_id` (`space_id`) ) ENGINE=InnoDB COMMENT=&#39;Authorization policy.&#39;; Access Actions The following are the predefined action types in the system: @Getter @RequiredArgsConstructor @Accessors(fluent = true) public enum PolicyAction { AnyAdmAction(&quot;adm:*&quot;), AnyResAction(&quot;res:*&quot;), AnyOrgAction(&quot;org:*&quot;), AdmAddTeam(&quot;adm:addTeam&quot;), AdmRemoveTeam(&quot;adm:removeTeam&quot;), AdmAddRole(&quot;adm:addRole&quot;), AdmRemoveRole(&quot;adm:removeRole&quot;), AdmAddMember(&quot;adm:addMember&quot;), AdmRemoveMember(&quot;adm:removeMember&quot;), AdmAddKey(&quot;adm:addKey&quot;), AdmRemoveKey(&quot;adm:removeKey&quot;), AdmGrantTeam(&quot;adm:grantTeam&quot;), AdmGrantRole(&quot;adm:grantRole&quot;), AdmGrantPolicy(&quot;adm:grantPolicy&quot;), ResListDir(&quot;res:listDir&quot;), ResCreateDir(&quot;res:createDir&quot;), ResDeleteDir(&quot;res:deleteDir&quot;), ResForceDelDir(&quot;res:forceDelDir&quot;), ResReadNode(&quot;res:readNode&quot;), ResEditNode(&quot;res:editNode&quot;), ResCreateNode(&quot;res:createNode&quot;), ResDeleteNode(&quot;res:deleteNode&quot;), ... ; } Some actions are defined with domain-prefix:*, they be used to achieve batch association for all actions under this domain. When new actions are added under this domain later, policies using wildcards will automatically include these actions. Condition Expression The condition expression is a JSON string that can express complex rules: When the expression result is true, the rule takes effect When the expression result is false, the rule does not take effect If complex matching functionality is not needed, just simply set it to the constant true. @RequiredArgsConstructor public abstract class PolicyMatcher { private final String expression; public abstract boolean match(PolicyContext ctx); static class TrueMatcher extends PolicyMatcher { TrueMatcher() { super(&quot;true&quot;); } @Override public boolean match(PolicyContext ctx) { return true; } } static PolicyMatcher compile(String expression) { if (&quot;true&quot;.equals(expression)) { return new TrueMatcher(); } ... } } Implementation Details Resource Visibility Resource access permissions should be inheritable: If a user has access permissions for the parent node, they automatically have access permissions for the child nodes. If a user lacks access permissions for the parent node, they automatically lack access permissions for the child nodes. However, administrators can also achieve the following effects through particular policy: The parent node has access permissions, but the child nodes do not (through Deny effect) The parent node lacks access permissions, but the child nodes have access permissions (through Allow effect) flowchart TB root(Root) --- a(A) root(Root) --- b(B) a(A)---a1(A1) a(A)---a2(A2) subgraph aa[&quot; &quot;] a1(A1)---a11(...) a1(A1)---a12(...) end a2(A2)---a21(...) a2(A2)---a22(...) b(B)---b1(B1) b(B)---b2(B2) subgraph bb[&quot; &quot;] b1(B1)---b11(...) b1(B1)---b12(...) end b2(B2)---b21(...) b2(B2)---b22(...) classDef group fill:none,stroke-width:2px,stroke-dasharray: 5 5 classDef allow fill:#e0eafd,stroke:#5084fd,stroke-width:3px classDef deny fill:#ffa640,stroke:#f76b66,stroke-width:3px class root,a,a2,a21,a22,b1,b11,b12 allow; class b,a1,a11,a12,b2,b21,b22 deny; class aa,bb group; flowchart LR Allow(Allow):::allow ~~~ Deny(Deny):::deny classDef allow fill:#e0eafd,stroke:#5084fd,stroke-width:3px classDef deny fill:#ffa640,stroke:#f76b66,stroke-width:3px To ensure completeness of presentation, as long as child nodes exist, the parent node must be visible to the user: 当用户获取 A 节点的子节点时，流程如下： 用户查询 A 节点的访问列表，发现本身具备访问权限 查询 A 节点的子节点信息，得到 A1 与 A2 两个节点 查询规则树，发现 A1 存在 Deny 规则，因此将 A1 节点从返回列表中剔除 返回 A2 数据给用户 当用户获取 B 节点的子节点时，流程如下： 用户查询 B 节点的访问列表，发现本身不具备访问权限 查询规则树，发现 B1 存在 Allow 规则，因此判断 B 节点具有访问权限 查询 B 节点的子节点信息，得到 B1 与 B2 两个节点 返回 B1 数据给用户 When a user retrieves the child nodes of node A, the process is as follows: Check the ACL of node A and finds that they have access permissions. Query the child nodes of node A, found nodes A1 and A2. Check the rule tree and finds that node A1 has a Deny rule, so node A1 is removed from the returned list. Node A2 data is returned to the user. When a user retrieves the child nodes of node B, the process is as follows: Check the ACL of node B and finds that they do not have access permissions. Check the rule tree and finds that node B1 has an Allow rule, means node B has access permissions. Query the child nodes of node B, found nodes B1 and B2. Node B1 data is returned to the user. ACL and ACT To achieve the above resource visibility, two rules-matching data structures are introduced: ACL only loads policies related to specific resources and determines whether the user has permission to access those resources. List&lt;ResourcePath&gt; paths = resourceService.getResourcesById(nodeId); List&lt;Policy&gt; policies = policyService.getPolicies(spaceId, ResourceType.Path); PolicyACL&lt;ResourcePath&gt; ctrl = PolicyACL.attachPolicyToResource(paths, policies, ResourcePath::getNodeId); Preconditions.checkState(ctrl.hasPermission(currentContext(), PolicyAction.ResListDir)); ACT loads all resource policies and determines whether the current user has permission to access the resource. List&lt;ResourcePath&gt; nodes = resourceService.getResources(rootId); List&lt;Policy&gt; policies = policyService.getPolicies(spaceId, ResourceType.Path); PolicyACT&lt;ResourceNode&gt; tree = PolicyACT.buildTreeAndGetRoot(nodes, ResourceNode::new, ResourcePath::getNodeId, ResourcePath::getParentId); tree.attachPolicyToTree(policies); for (ResourcePath node: nodes) { Preconditions.checkState(tree.hasAllowChild(node.getNodeId(), currentContext(), PolicyAction.ResListDir)); Preconditions.checkState(!tree.hasDenyParent(node.getNodeId(), currentContext(), PolicyAction.ResListDir)); } AccessControlList @Data @Accessors(chain = true) @RequiredArgsConstructor public class PolicyCarrier&lt;R&gt; { private static class CarrierChain { CarrierNode allowances; CarrierNode denials; @Override public String toString() { List&lt;PolicyMatcher&gt; allow = null; List&lt;PolicyMatcher&gt; deny = null; if (allowances != null) allowances.visit((allow = new ArrayList&lt;&gt;())::add); if (denials != null) denials.visit((deny = new ArrayList&lt;&gt;())::add); return &quot;(&quot; + &quot;allow:&quot; + allow + &quot;, deny:&quot; + deny + &#39;)&#39;; } } private record CarrierNode(PolicyMatcher matcher, CarrierNode next) { boolean match(PolicyContext context) { CarrierNode node = this; while (node != null &amp;&amp; !node.matcher.match(context)) node = node.next; return node != null; } void visit(Consumer&lt;PolicyMatcher&gt; visitor) { CarrierNode node = this; while (node != null) { visitor.accept(node.matcher); node = node.next; } } } R object; Map&lt;PolicyAction, CarrierChain&gt; matchers; public void addMatcher(PolicyAction action, PolicyEffect effect, PolicyMatcher matcher) { if (matchers == null) matchers = new IdentityHashMap&lt;&gt;(0); CarrierChain chain = matchers.computeIfAbsent(action, k -&gt; new CarrierChain()); switch (effect) { case Allow -&gt; chain.allowances = new CarrierNode(matcher, chain.allowances); case Deny -&gt; chain.denials = new CarrierNode(matcher, chain.denials); } } public PolicyEffect matchEffect(PolicyContext context, PolicyAction action) { if (matchers != null) { CarrierChain chain = matchers.get(action); if (chain != null) { // Allow effect has higher priority if (chain.allowances != null &amp;&amp; chain.allowances.match(context)) { return PolicyEffect.Allow; } if (chain.denials != null &amp;&amp; chain.denials.match(context)) { return PolicyEffect.Deny; } } } return null; } } /** * ACL is a node list, indicating the path from the root node to the current node &lt;p/&gt; * Pass the permission information of the parent node to the child node through bottom-up query */ public class PolicyACL&lt;R&gt; extends ArrayList&lt;PolicyCarrier&lt;R&gt;&gt; { public PolicyACL(int capacity) { super(capacity); } public static &lt;T&gt; PolicyACL&lt;T&gt; attachPolicyToResource(List&lt;T&gt; resources, List&lt;Policy&gt; policies, Function&lt;T,String&gt; toId) { PolicyACL&lt;T&gt; acl = new PolicyACL&lt;&gt;(resources.size()); for (T res : resources) { PolicyCarrier&lt;T&gt; attach = new PolicyCarrier&lt;T&gt;().setObject(res); for (Policy policy : policies) { // The resource ID may be repeated, can&#39;t use map here if (policy.getResource().equals(toId.apply(res))) { PolicyMatcher condition = PolicyMatcher.compile(policy.getCondition()); PolicyAction.parse(policy.getAction(), action -&gt; attach.addMatcher(action, policy.getEffect(), condition)); } } acl.add(attach); } return acl; } public void checkAllPermission(PolicyContext context, Collection&lt;PolicyAction&gt; actions) { Checker.check(!hasAllPermission(context, actions), PermissionException.NODE_ACCESS_DENIED); } public boolean hasAllPermission(PolicyContext context, Collection&lt;PolicyAction&gt; actions) { for (PolicyAction action : actions) { if (!hasPermission(context, action)) return false; } return !actions.isEmpty(); } public void checkAnyPermission(PolicyContext context, Collection&lt;PolicyAction&gt; actions) { Checker.check(!hasAnyPermission(context, actions), PermissionException.NODE_ACCESS_DENIED); } public boolean hasAnyPermission(PolicyContext context, Collection&lt;PolicyAction&gt; actions) { for (PolicyAction action : actions) { if (hasPermission(context, action)) return true; } return false; } public void checkPermission(PolicyContext context, PolicyAction action) { Checker.check(!hasPermission(context, action), PermissionException.NODE_ACCESS_DENIED); } // Bottom-up query public boolean hasPermission(PolicyContext context, PolicyAction action) { for (int i = size()-1; i &gt;= 0; i--) { PolicyEffect effect = get(i).matchEffect(context, action); if (effect != null) { // Inherit permissions from the nearest parent node that specifies permissions return effect == PolicyEffect.Allow; } } return false; } public R resource() { return isEmpty() ? null : get(size()-1).getObject(); } } AccessControlTree @Setter @RequiredArgsConstructor @SuppressWarnings(&quot;unchecked&quot;) public class PolicyNode&lt;T&gt; extends PolicyCarrier&lt;T&gt; { PolicyNode&lt;T&gt; parent; List&lt;PolicyNode&lt;T&gt;&gt; children; public &lt;Node extends PolicyNode&lt;T&gt;&gt; Node getParent() { return (Node) parent; } public &lt;Node extends PolicyNode&lt;T&gt;&gt; List&lt;Node&gt; getChildren() { return (List&lt;Node&gt;) children; } public boolean containsNode(Predicate&lt;T&gt; predicate) { if (predicate.test(getObject())) { return true; } return hasChildren(predicate, false); } public boolean hasChildren(Predicate&lt;T&gt; predicate, boolean direct) { Deque&lt;PolicyNode&lt;T&gt;&gt; stack = new ArrayDeque&lt;&gt;(); stack.push(this); boolean visited = false; while(!stack.isEmpty()) { PolicyNode&lt;T&gt; node = stack.pop(); if (ObjectUtils.isNotEmpty(node.getChildren()) &amp;&amp; (!visited || !direct)) { for (PolicyNode&lt;T&gt; child : node.getChildren()) { if (predicate.test(child.getObject())) { return true; } } stack.addAll(node.getChildren()); } visited = true; } return false; } @Override public String toString() { return &quot;Node(&quot; + &quot;obj=&quot; + getObject() + &quot;, matchers=&quot; + getMatchers() + &#39;)&#39;; } } /** * ACT is a sparse node tree that only contains nodes for which the user has explicitly specified rules&lt;p/&gt; * Pass the permission information of the child node to the parent node through top-down query */ @SuppressWarnings({&quot;unchecked&quot;,&quot;rawtypes&quot;}) public record PolicyACT&lt;Node extends PolicyNode&gt;(Node root, Map&lt;String, Node&gt; lookup) { public static &lt;T, Node extends PolicyNode&lt;T&gt;&gt; PolicyACT&lt;Node&gt; buildTreeAndGetRoot( List&lt;T&gt; resources, Supplier&lt;Node&gt; newNode, Function&lt;T, String&gt; toId, Function&lt;T, String&gt; toParentId) { List&lt;PolicyACT&lt;Node&gt;&gt; trees = buildTreeAndGetRoots(resources, newNode, toId, toParentId); Checker.check(trees.size() != 1, &quot;unexpected tree topology&quot;); return trees.get(0); } public static &lt;T, Node extends PolicyNode&lt;T&gt;&gt; List&lt;PolicyACT&lt;Node&gt;&gt; buildTreeAndGetRoots(List&lt;T&gt; resources, Supplier&lt;Node&gt; newNode, Function&lt;T, String&gt; toId, Function&lt;T, String&gt; toParentId) { Map&lt;String, Node&gt; lookup = new HashMap&lt;&gt;(); resources.forEach(x -&gt; lookup.put(toId.apply(x), (Node) newNode.get().setObject(x))); for (Node node : lookup.values()) { Node parent = lookup.get(toParentId.apply(node.getObject())); if (parent != null &amp;&amp; parent != node) { node.setParent(parent); if (parent.getChildren() == null) { parent.setChildren(new ArrayList&lt;&gt;()); } parent.getChildren().add(node); } } List&lt;PolicyACT&lt;Node&gt;&gt; roots = new ArrayList&lt;&gt;(1); for (Node node : lookup.values()) { if (node.getParent() == null) { roots.add(new PolicyACT&lt;&gt;(node, lookup)); } } return roots; } public void attachPolicyToTree(List&lt;Policy&gt; policies) { for (Policy policy : policies) { PolicyMatcher condition = PolicyMatcher.compile(policy.getCondition()); Node node = lookup().get(policy.getResource()); PolicyAction.parse(policy.getAction(), action -&gt; node.addMatcher(action, policy.getEffect(), condition)); } } // Top-down query public boolean hasAllowChild(String nodeId, PolicyContext context, PolicyAction action) { Node node = lookup().get(nodeId); if (node != null) { Queue&lt;Node&gt; queue = new LinkedList&lt;&gt;(); queue.add(node); while (!queue.isEmpty()) { Node n = queue.poll(); PolicyEffect effect = n.matchEffect(context, action); if (effect == PolicyEffect.Allow) { return true; // Return if any child node is specified with Allow } if (ObjectUtils.isNotEmpty(n.getChildren())) { queue.addAll(n.getChildren()); } } } return false; } // Bottom-up query public boolean hasDenyParent(String nodeId, PolicyContext context, PolicyAction action) { Node node = lookup().get(nodeId); while (node != null) { PolicyEffect effect = node.matchEffect(context, action); if (effect == PolicyEffect.Deny) { return true; // Return if any parent node is specified with Deny } node = (Node) node.getParent(); } return false; } } "},{"slug":"opening-remarks","title":"Opening remarks - the evolution of front-end technology I have experienced","tags":["FrontendDev"],"content":"As a full-stack developer who knows a little about front-end, here is a brief review of what I have seen and heard in recent years as the beginning of this blog. jQuery: The Old King Upon graduating, I joined an e-commerce company as a web application developer. The company was in its early stages, and everything seemed free-spirited, especially the technology stack. At that time, the architect sneaked in some personal preferences, using the &quot;full-featured online shopping system&quot; he tinkered with in his spare time as the development template for all projects. As expected, an unexpected issue arose: during the first online flash sale, the system collapsed with less than 200 TPS... Few months later, the company went under. While the boss suffered losses and I lost my job, the architect bought himself a luxury home in the city center. Checking his recent status, I discovered that the shopping system had been packaged into Software as a Service (SaaS), quietly waiting for the next victim. As one of the victims of this system, I&#39;ll briefly share the frontend technology stack: Since the project&#39;s development language was Java, all HTML pages were dynamically rendered through JSP. Additionally, the following two frontend frameworks were introduced on this basis: JQuery jQuery was released in 2006 as a fast and lightweight JavaScript library designed to simplify client-side scripting for HTML. (function($) { // Classic closure var hiddenBox = $( &quot;#banner-message&quot; ); // Event listening + DOM animation $( &quot;#button-container button&quot; ).on( &quot;click&quot;, function( event ) { hiddenBox.show(); }); $.ajax({ // Ajax asynchronous communication url: &quot;/api/getWeather&quot;, data: { zipcode: 97201 }, success: function( result ) { $( &quot;#weather-temp&quot; ).html( &quot;&lt;strong&gt;&quot; + result + &quot;&lt;/strong&gt; degrees&quot; ); } }); })(jQuery); Bootstrap Bootstrap was released in 2011, providing a standardized set of CSS styles based on pre-defined class names. It is designed to create consistent user interfaces across multiple platforms. &lt;!DOCTYPE html&gt; &lt;head&gt; &lt;link rel=&quot;stylesheet&quot; href=&quot;./css/bootstrap.min.css&quot;&gt; &lt;/head&gt; &lt;body&gt; &lt;div class=&quot;container&quot;&gt; &lt;div class=&quot;row&quot;&gt; &lt;button type=&quot;button&quot; class=&quot;btn btn-primary&quot;&gt;Large&lt;/button&gt; &lt;!-- Blue regular-sized button --&gt; &lt;button type=&quot;button&quot; class=&quot;btn btn-success btn-lg&quot;&gt;Large&lt;/button&gt; &lt;!-- Green large-sized button --&gt; &lt;button type=&quot;button&quot; class=&quot;btn btn-warning btn-sm&quot;&gt;Small&lt;/button&gt; &lt;!-- Orange small-sized button --&gt; &lt;/div&gt; &lt;/div&gt; &lt;/body&gt; &lt;/html&gt; The popularity of these two frameworks is closely related to the historical context of that time - browser compatibility. During that period, the HTML5 specification had not yet become widespread, and the support for this standard by most browsers was far from perfect. Developers had to deal not only with the new and somewhat unstable browsers like Firefox and Chrome, but also with the notorious IE due to its high market share on Windows. In this era, both jQuery and Bootstrap not only provided out-of-the-box cross-browser compatibility but also offered numerous polyfill plugins. These plugins enabled older browsers like IE to support some HTML5 features. This significantly boosted development efficiency, leading many developers to embrace these frameworks. AngularJS: Origin Of Magic Later on, I joined a logistics company, taking on the role of full-stack development for the internal management system. It was here that I first encountered the concept of frontend-backend separation and began learning about the NPM ecosystem. During this period, my technical skills were relatively modest, but my enthusiasm for learning was soaring. Motivated by the promises from leadership, I utilized the Baidu Maps API to create an &quot;Express Station Boundary Management System.&quot; This marked my first independently completed project in my professional career. Although I eventually left the company, I am sincerely grateful to my leaders and colleagues at the time. I learned a great deal from them, and this experience holds a special place in my career journey. SPA As the HTML5 specification gained widespread support, compatibility ceased to be a critical issue in the frontend domain. The increasing demand for stringent user experiences became a new challenge for developers. Among various factors, the most significant impact was on page refresh operations: Each page transition required a network request, and network communication latency affected the loading speed of pages. Resource files such as JS and CSS needed to be reloaded and initialized, further increasing the page rendering time. To fundamentally address this issue, the design pattern known as Single Page Application (SPA) emerged. Its key features include: Dynamic Loading: SPAs initially load HTML, CSS, and JavaScript resources, dynamically updating content as users interact with the application. Client-Side Routing: SPAs use client-side routing to manage navigation within the application. URL changes dynamically based on user interactions, but the actual page doesn&#39;t reload. Performance Improvement: Since SPAs only fetch data needed to update the current view, they typically reduce the amount of data transferred between the client and server, thereby enhancing performance. Applications using the SPA pattern have the following advantages: Enhanced User Experience: SPAs eliminate the delay of full-page reloads, providing a smoother and more engaging user experience. Reduced Server Load: SPAs request only necessary data, reducing the server load and optimizing bandwidth usage. Frontend-Backend Separation: Through API integration for frontend-backend separation and specialized technical team development, SPA applications further improve organizational development efficiency. However, in the SPA pattern, the client needs to implement a considerable amount of complex DOM listening and modification logic. Taking the common example of submitting a form: Real-time monitoring of user modifications to the form, promptly notifying users of any errors. Disabling the submit button after the user clicks it to prevent duplicate submissions. Providing user feedback upon Ajax response, indicating successful operations, and clearing the form data. $(document).ready(function() { // Listen for user input $(&#39;#numberInput&#39;).on(&#39;input&#39;, function() { var inputValue = $(this).val(); if ($.isNumeric(inputValue)) { // Validate the input $(&#39;button[type=&quot;submit&quot;]&#39;).prop(&#39;disabled&#39;, false); } else { $(&#39;button[type=&quot;submit&quot;]&#39;).prop(&#39;disabled&#39;, true); } }); // Listen for form submission $(&#39;#myForm&#39;).submit(function(event) { var inputValue = $(&#39;#numberInput&#39;).val(); if ($.isNumeric(inputValue)) { $.ajax({ // Initiate the request url: &#39;your_ajax_endpoint&#39;, type: &#39;POST&#39;, data: { number: inputValue }, success: function(response) { $(&#39;#numberInput&#39;).val(&#39;&#39;); // Clear the form alert(&#39;Ajax Request Successful&#39;); }, error: function() { alert(&#39;An error occurred during the Ajax request&#39;); } }); } else { alert(&#39;Please enter a valid number&#39;); } }); }); Note: In a real-world scenario, where the form contains multiple fields and potential interdependencies, the code structure may need to be adapted to handle such complexity. AngularJS In line with the trend of SPA, Google released a front-end JavaScript framework called AngularJS in 2010. AngularJS, with its more declarative and expressive syntax, pioneered a new development paradigm, significantly improving the efficiency of SPA development. AngularJS introduced two magical features: Two-Way Data Binding: AngularJS seamlessly connected models and views. Any changes to the model automatically reflected in the view, and vice versa. Directives: AngularJS extended HTML syntax, allowing the creation of custom and reusable components. These directives enhanced the structure and functionality of applications. Below is a snippet of HTML code that can be directly executed. Users do not need to write any JavaScript code; they can achieve interaction between DOM elements through directives: &lt;!DOCTYPE html&gt; &lt;html&gt; &lt;script src=&quot;https://ajax.googleapis.com/ajax/libs/angularjs/1.6.9/angular.min.js&quot;&gt;&lt;/script&gt; &lt;body&gt; &lt;!-- ng-app 指令表示该 DOM 是 SPA 应用的根元素 --&gt; &lt;form ng-app=&quot;&quot; name=&quot;myForm&quot;&gt; &lt;!-- ng-model 指令定义需要双向绑定的字段，可以同时指定校验条件 --&gt; &lt;p&gt;Number : &lt;input type=&quot;number&quot; min=&quot;0&quot; max=&quot;99&quot; ng-model=&quot;numberInput&quot;&gt;&lt;/p&gt; &lt;!-- ng-show 指令监听表单校验结果，并动态提示错误 --&gt; &lt;span ng-show=&quot;!myForm.$valid&quot;&gt;Invalid number!&lt;/span&gt; &lt;/form&gt; &lt;/body&gt; &lt;/html&gt; Compared to jQuery, AngularJS provides a more concise implementation of DOM binding, and the interaction between elements is exceptionally smooth, significantly reducing repetitive template code. More importantly, AngularJS transforms traditional procedural programming into directive-based declarative programming, pioneering a new paradigm in front-end development. React: Efficiency First Recently, I joined a digital finance company, responsible for managing historical data and gradually transitioning to pure backend development. During a certain project, the company aimed to revitalize a massive volume of transaction details while meeting the needs of manual queries and compliance checks. I proposed an implementation based on HBase, requiring only 6TB of space to store 10 years of data. Due to various reasons, big data operations and frontend resources were not adequately in place, so I had to handle the entire process myself. During this period, I began to explore the React ecosystem and developed a backend management system using AntDesign. The emergence of React was to address the development challenges in increasingly complex user interaction scenarios, such as data visualization and online collaboration. Unlike traditional form applications, these scenarios face two new problems: Complex State Management: State changes are no longer a simple linear process; there is a need to manage dependencies and propagation between multiple change events. Low Efficiency in DOM Node Rendering: A single change event may trigger multiple DOM redraws, and frequent changes can consume a significant amount of browser resources. To address these issues, Facebook introduced the React framework in 2013, characterized by the following features: Unidirectional Data Flow: Follows a unidirectional data flow, ensuring that data changes are handled predictably, avoiding the additional complexity introduced by bidirectional binding. Declarative Syntax: Provides a declarative syntax based on JSX templates. Developers only need to describe the desired result, and React efficiently updates the DOM. Virtual DOM: The virtual DOM is a lightweight replica of the real DOM. Incremental updates can be achieved by comparing the differences between the two, reducing unnecessary rendering. Below is a snippet of HTML code that can be directly executed. The implementation might seem somewhat complex, but the flexibility of the syntax is a level above AngularJS: &lt;html&gt; &lt;body&gt; &lt;div id=&quot;root&quot;&gt;&lt;/div&gt; &lt;script src=&quot;https://unpkg.com/babel-standalone@6.26.0/babel.min.js&quot;&gt;&lt;/script&gt; &lt;script src=&quot;https://unpkg.com/react@17.0.2/umd/react.development.js&quot;&gt;&lt;/script&gt; &lt;script src=&quot;https://unpkg.com/react-dom@17.0.2/umd/react-dom.development.js&quot;&gt;&lt;/script&gt; &lt;script type=&quot;text/babel&quot;&gt; // Define a functional component const MyForm = () =&gt; { // Declare state for input validation const [invalid, setInvalid] = React.useState(false) // Validation function triggered on input change const validateInput = (e) =&gt; { const num = e.target.value setInvalid(!!e.target.value &amp;&amp; (num &lt; 0 || num &gt; 99)) } // JSX template for the form return ( &lt;form id=&quot;myForm&quot;&gt; &lt;p&gt;Number :&lt;input type=&quot;text&quot; onChange={validateInput}/&gt;&lt;/p&gt; {invalid &amp;&amp; &lt;span&gt;Invalid number!&lt;/span&gt;} &lt;/form&gt; ) } // Render the JSX component into the root div ReactDOM.render(&lt;MyForm/&gt;, document.getElementById(&#39;root&#39;)) &lt;/script&gt; &lt;/body&gt; &lt;/html&gt; Nextjs: Renaissance As time passes, I increasingly feel the importance of keeping records, leading to the idea of building a personal blog. The blog needs to meet the following requirements: Articles are written in Markdown, making it easy to switch front-end technology stacks. Construct a static resource site, beneficial for SEO, and reducing server costs. After exploring few solutions, I found Next.js to be a good choice. After some tinkering, this article finally appears before you. SSR While SPA can provide a decent user experience, this design pattern itself has some inherent drawbacks: Slow Initial Loading: Since all client-side logic is bundled into a bundle.js file, the initial load consumes a significant amount of time downloading and executing JS code, resulting in a prolonged blank screen. Not SEO-Friendly: All page content is dynamically rendered by the browser, which is not friendly to search engines that only parse static HTML. This affects website search rankings, hindering business promotion. To address this issue, a batch of front-end solutions supporting Server-Side Rendering (SSR) emerged: Pre-rendering operations are performed on the server-side, reducing the rendering time required on the client-side. Unnecessary logic in the bundle.js file is compressed, improving page loading speed. Files returned to the client-side include complete HTML pages, making it easy for search engines to crawl and index content, beneficial for improving website search rankings. SSG Similar to SSR, there is also a feature called Static Site Generation (SSG). SSG generates a set of static HTML, CSS, and JavaScript files during the webpage building process. When a request is received, these files can be directly returned to the user without the need to render the page. In comparison to SSR, SSG has two additional advantages: No need to render pages, reducing server load and providing faster loading times—a cost-effective and efficient solution. Static files can be easily distributed and cached on a CDN, ensuring the site&#39;s availability worldwide and offering faster access for global users. Nextjs In the React ecosystem, the most popular SSR solution is Next.js. This framework simplifies the process of building server-rendered React applications and provides a set of conventions and tools to help developers focus on building application functionality rather than dealing with configuration. Next.js supports both SSR and Static Site Generation (SSG) modes, with the main difference being the deployment environment: If a project requires SSR features, it must be deployed in a Node.js environment. If the project only includes SSG features, it can be deployed on any static server, such as Nginx. In SSR mode, you need to implement the getServerSideProps function, which is called every time a user makes a request. import type { InferGetServerSidePropsType, GetServerSideProps } from &#39;next&#39; type Repo = { name: string stargazers_count: number } // Server-side function to fetch the star count of the next.js GitHub repository export const getServerSideProps = (async (context) =&gt; { const res = await fetch(&#39;https://api.github.com/repos/vercel/next.js&#39;) const repo = await res.json() return { props: { repo } } }) satisfies GetServerSideProps&lt;{ repo: Repo }&gt; // Render the page based on the star count of the repository and return it to the client export default function Page({ repo, }: InferGetServerSidePropsType&lt;typeof getServerSideProps&gt;) { return repo.stargazers_count } In the SSG mode, you need to implement the getStaticProps function, which is called only once during the page build process. // Fetch the blog post list export async function getStaticProps() { const res = await fetch(&#39;https://.../posts&#39;) const posts = await res.json() return { props: { posts, }, } } // Generate a static page based on the blog post list export default function Blog({ posts }) { return ( &lt;ul&gt; {posts.map((post) =&gt; ( &lt;li&gt;{post.title}&lt;/li&gt; ))} &lt;/ul&gt; ) } Additionally, Next.js provides the following important features: Code Splitting: It splits the globally unified entry file bundle.js into multiple chunk.js files. Each page only loads the JS code it requires. Incremental Static Regeneration: Allows individual static pages to be regenerated after the website is built. This enables updating specific static pages without rebuilding the entire website. Image Loading Optimization: It offers various commonly used features using the Image component, such as automatically generating multiple thumbnail sizes and preventing layout shifts during loading. The Way To The Future In the process of writing this article, I experimented with generating sample code using ChatGPT, and the results were surprisingly good: the framework code produced by ChatGPT only required minor modifications to run. The AI wave that began earlier this year has had a significant impact on the gaming industry and is gradually spreading to other sectors. The recently released GPT-4 has the capability to convert images into HTML pages. This signifies that low-level programming could be replaced by AI in the not-too-distant future. Many freelancers have already discovered numerous benefits from AI: Independent game developers using AI to replace graphic designers. Entrepreneurs using AI to create project proposals. Content creators using AI to generate articles. As developers, it is essential to acknowledge the impact of AI and explore how to leverage it to improve development efficiency. "},{"slug":"redis-eviction","title":"Redis Cache Eviction Mechanism","tags":["Redis","SystemDesign"],"content":"This article analyzes the cache eviction mechanism of Redis from the source code level and describes the implementation approach using Java at the end of the article for reference. Relevant Configurations To adapt to caching scenarios, Redis supports cache eviction and provides corresponding configurations: maxmemory Sets the upper limit of memory usage, which cannot be set to a capacity less than 1M. The default value of this option is 0, in which case the system will calculate a memory limit on its own. maxmemory-policy Each database in Redis maintains two dictionaries: db.dict: All key-value pairs in the database, also known as the keyspace of the database db.expires: Keys with a lifecycle and their corresponding TTL (time to live), thus also known as the expire set When the maximum memory usage maxmemory is reached, the available strategies for cleaning the cache are: noeviction: Returns an error when the maximum memory is reached, without evicting any data. allkeys-lfu: Evicts the least frequently used (LFU) keys in the entire keyspace (version 4.0 or higher). allkeys-lru: Evicts the least recently used (LRU) keys in the entire keyspace. allkeys-random: Evicts random keys in the entire keyspace. volatile-ttl: Evicts the key with the shortest TTL in the expire set. volatile-lfu: Evicts the least frequently used keys in the expire set (version 4.0 or higher). volatile-lru: Evicts the least recently used (LRU) keys in the expire set. volatile-random: Evicts random keys in the expire set. When the expire set is empty, *volatile- **behaves the same as noeviction. maxmemory-samples To ensure performance, Redis uses approximate implementations of LRU and LFU algorithms. When need to evict record, it does not traverse all records but selects a subset of records for eviction through random sampling. The maxmemory-samples option controls the number of samples in this process. Increasing this value will increase CPU overhead, but the algorithm&#39;s effectiveness can better approximate actual LRU and LFU. lazyfree-lazy-eviction Cache cleaning need system call to free memory, which blocks the main thread. When deleting a gigantic record (e.g. a list containing hundreds of entries), it can cause performance issues or even lead to system frozen. The lazy freeing mechanism delegates the releases procedure to other threads, thereby improving system performance. Enabling this option may result in exceeding the memory usage limit of maxmemory. Cache Eviction Mechanism A complete cache eviction mechanism needs to address two issues: Determining which records to evict — Eviction Strategy Deleting the evicted records — Deletion Strategy Eviction Strategy The memory available for caching is limited. When space is insufficient, data that will not be accessed again should be evicted. Therefore, eviction algorithms are designed around the principle of temporal locality: if data is being accessed, it is likely to be accessed again in the near future. To adapt to the characteristic of more reads than writes in caching scenarios, hash tables are commonly used to implement caches. When implementing specific cache eviction policies, additional bookkeeping structures need to be introduced. Let&#39;s review the 3 most common cache eviction strategies. FIFO (First In, First Out) The data that entered the cache earlier is more likely not to be accessed again. Therefore, we should first the cached record that has been in memory for the longest time. This strategy can be implemented using a queue: flowchart TB latest --&gt;|cache| FIFO subgraph FIFO[&quot;Queue&quot;] direction LR e1[&quot;Entry&quot;] e2[&quot;Entry&quot;] e3[&quot;Entry&quot;] oldest[&quot;oldest&quot;] end oldest --x|evict| X(((&quot; &quot;))) Pros: Simple implementation, suitable for linear access scenarios. Cons: Cannot adapt to specific access hotspots, poor cache hit rate. Bookkeeping Overhead: Time O(1), Space O(N) LRU (Least Recently Used) After a cached record is accessed, it is highly likely to be accessed again in the near future. So we can keep the latest access time for each record, and the data with the oldest access time should be evicted first. An alternative implement of LRU is keep the access sequence, usually implemented as a linked list: flowchart LR latest --&gt;|cache| e1 subgraph LRU[&quot;Linked List&quot;] direction LR e1[&quot;Entry&quot;] e2[&quot;Entry&quot;] e3[&quot;Entry&quot;] e1 --&gt; e2 e2 --&gt; e3 end e3 ---x |evict| oldest When record is accessed, just move it to the head of link list by adjusting pointer: %%{init: {&quot;flowchart&quot;: {&quot;curve&quot;: &quot;linear&quot; }} }%% flowchart LR subgraph LRU[&quot;Linked List&quot;] direction LR A --&gt; B B --&gt; C C --&gt; D D --&gt; E end subgraph X[&quot; &quot;] direction TB latest --&gt; A oldest --&gt; E end style X fill:none,stroke:none block-beta columns 5 space:2 down&lt;[&quot;hit E again&quot;]&gt;(down) space:2 %%{init: {&quot;flowchart&quot;: {&quot;curve&quot;: &quot;linear&quot; }} }%% flowchart LR subgraph LRU[&quot;Linked List&quot;] direction LR A --&gt; B B --&gt; C C --&gt; D D -.-x E E --&gt; A end subgraph X[&quot; &quot;] direction TB latest --&gt; E oldest --&gt; D end style X fill:none,stroke:none Pros: High cache hit rate, suitable for scenarios where access has locality characteristics. Cons: Higher bookkeeping space overhead, especially when accessing records. Bookkeeping Overhead: Time O(1), Space O(N) LRU Improvement The original LRU algorithm caches data that has been accessed only once recently, thus failing to distinguish well between cold data and hot data. This means that some cold data may also enter the cache, pushing out the hot data. To reduce the impact of sporadic accesses, the subsequent improvement proposed is the LRU-K algorithm, which makes the following enhancements: Adding a history queue on the basis of LRU bookkeeping. When the access count is less than K, data will be record in history queue. When the access count is greater than or equal to K, recdataord will be removed from the history queue and recorded in the LRU cache. When the history queue is full, FIFO or LRU strategy can be used for eviction. The larger the K value, the higher the cache hit rate, but the adaptability is poor, requiring a large number of accesses to eliminate expired hot records. Taking various factors into account, the commonly used algorithm in practice is LRU-2: flowchart LR latest --&gt;|enqueue| FIFO oldest --x|dequeue| #(((&quot; &quot;))) subgraph FIFO[&quot;History Queue&quot;] direction TB A[&quot;&amp;nbsp;&amp;nbsp;A&amp;nbsp;&amp;nbsp;&quot;] B[&quot;&amp;nbsp;&amp;nbsp;B&amp;nbsp;&amp;nbsp;&quot;] C[&quot;&amp;nbsp;&amp;nbsp;C&amp;nbsp;&amp;nbsp;&quot;] D[&quot;&amp;nbsp;&amp;nbsp;D&amp;nbsp;&amp;nbsp;&quot;] oldest[&quot;oldest&quot;] end subgraph LRU[&quot;LRU Cache&quot;] direction LR X[&quot;&amp;nbsp;&amp;nbsp;X&amp;nbsp;&amp;nbsp;&quot;] Y[&quot;&amp;nbsp;&amp;nbsp;Y&amp;nbsp;&amp;nbsp;&quot;] Z[&quot;&amp;nbsp;&amp;nbsp;Z&amp;nbsp;&amp;nbsp;&quot;] X --&gt; Y Y --&gt; Z C --&gt;|hit C again| X end Pros: Reduces the impact of sporadic accesses on cache hit rate. Cons: Requires additional bookkeeping overhead. Bookkeeping Overhead: Time $O(1)$, Space $O(N+M)$ LFU (Least Frequently Used) The more frequently a data is accessed in the recent period, the more likely it is to be accessed again. We can record the access frequency of each cache record over a recent period of time, and data with low access frequency will be evicted first. A simple way to implement LFU is to set a accesses counter for each record and put them into a min-heap: flowchart TB subgraph Heap[&quot;&amp;nbsp; Min-Heap&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot;] direction TB A[&quot;&amp;nbsp;&amp;nbsp;2&amp;nbsp;&amp;nbsp;&quot;] B[&quot;&amp;nbsp;&amp;nbsp;3&amp;nbsp;&amp;nbsp;&quot;] C[&quot;&amp;nbsp;&amp;nbsp;5&amp;nbsp;&amp;nbsp;&quot;] D[&quot;&amp;nbsp;&amp;nbsp;3&amp;nbsp;&amp;nbsp;&quot;] E[&quot;&amp;nbsp;&amp;nbsp;3&amp;nbsp;&amp;nbsp;&quot;] A --- B A --- C B --- D B --- E end X[&quot;&amp;nbsp;&amp;nbsp;1&amp;nbsp;&amp;nbsp;&quot;] X --&gt;|replace top| A To ensure adaptability, the counter needs to be decayed to ensure that expired hot data can be eliminated timely: block-beta columns 8 block:1:2 columns 4 space:2 A1[&quot;A:3&quot;] space space:4 space B1[&quot;B:4&quot;] space C1[&quot;C:6&quot;] space:4 D1[&quot;D:5&quot;] space E1[&quot;E:5&quot;] space end decay&lt;[&quot;decay twice&quot;]&gt;(right) block:2:2 columns 4 space:2 A2[&quot;A:1&quot;] space space:4 space B2[&quot;B:2&quot;] space C2[&quot;C:4&quot;] space:4 D2[&quot;D:3&quot;] space E2[&quot;E:3&quot;] space end hit&lt;[&quot;hit A twice&quot;]&gt;(right) block:3:2 columns 4 space:2 A3[&quot;B:2&quot;] space space:4 space B3[&quot;E:3&quot;] space C3[&quot;C:4&quot;] space:4 D3[&quot;D:3&quot;] space E3[&quot;A:3&quot;] space end A1 --- B1 A1 --- C1 B1 --- D1 B1 --- E1 A2 --- B2 A2 --- C2 B2 --- D2 B2 --- E2 A3 --- B3 A3 --- C3 B3 --- D3 B3 --- E3 Deletion Strategy Common deletion strategies can be divided into the following types: Instant Deletion Every time a new record is added, immediately search and eliminate expired records. Pros: Most memory-saving. Cons: Free memory will affect the efficiency of writing. Lazy Deletion: Two counters are set in the cache, one counts the number of cache accesses, and the other counts the number of eliminable records. After every N accesses or when the current number of eliminable records exceeds M, trigger a batch deletion (M and N can be adjusted). Pros: Minimal impact on normal cache operations, batch deletion reduces maintenance overhead. Cons: Causes memory space waste, and occasional deletion operations may cause fluctuation in access time. Asynchronous Deletion: Set up an independent timer thread to trigger batch deletion at fixed intervals. Pros: Transparent to normal cache operations, no additional performance overhead. Cons: Requires adding maintenance threads, and thread safety issues have to be considered. Implementation in Redis Redis implements two elimination strategies: LRU and LFU. To save space, Redis does not use the bookkeeping structures described above to implement LRU or LFU. Instead, it uses a 24-bit space in robj to record access information: #define LRU_BITS 24 typedef struct redisObject { ... unsigned lru:LRU_BITS; /* LRU time (relative to the global lru_clock time) or * LFU data (8 bits record access frequency, 16 bits record access time). */ } robj; Redis updates robj.lru when a record is hit: robj *lookupKey(redisDb *db, robj *key, int flags) { // ... // Depending on maxmemory_policy, choose a different update strategy if (server.maxmemory_policy &amp; MAXMEMORY_FLAG_LFU) { updateLFU(val); } else { val-&gt;lru = LRU_CLOCK(); } } The key to updating LFU and LRU lies in the updateLFU function and the LRU_CLOCK macro. Updating LRU Time When using the LRU algorithm, robj.lru records the timestamp of the last access, which can be used to identify records that have not been accessed for a long time. To reduce system calls, Redis sets a global clock server.lruclock and updates it by a background task: #define LRU_CLOCK_MAX ((1&lt;&lt;LRU_BITS)-1) /* Max value of obj-&gt;lru */ #define LRU_CLOCK_RESOLUTION 1000 /* Clock precision in milliseconds */ /** * The update frequency of server.lruclock is 1000/server.hz * If this frequency is higher than the clock precision of LRU, use server.lruclock directly * to avoid calling getLRUClock() incurring additional overhead */ #define LRU_CLOCK() ((1000/server.hz &lt;= LRU_CLOCK_RESOLUTION) ? server.lruclock : getLRUClock()) unsigned int getLRUClock(void) { return (mstime()/LRU_CLOCK_RESOLUTION) &amp; LRU_CLOCK_MAX; } The calculation of LRU time is as follows: unsigned long long estimateObjectIdleTime(robj *o) { unsigned long long lruclock = LRU_CLOCK(); if (lruclock &gt;= o-&gt;lru) { return (lruclock - o-&gt;lru) * LRU_CLOCK_RESOLUTION; } else { // Handle the case of LRU time overflow return (lruclock + (LRU_CLOCK_MAX - o-&gt;lru)) * LRU_CLOCK_RESOLUTION; } } When LRU_CLOCK_RESOLUTION is 1000ms, the maximum LRU duration that robj.lru can record is 194 days (0xFFFFFF / 3600 / 24). Updating LFU Counter When using the LFU algorithm, robj.lru is divided into two parts: the first 16 bits record the last access time, and the last 8 bits are used as a counter. void updateLFU(robj *val) { unsigned long counter = LFUDecrAndReturn(val); // Counter decay counter = LFULogIncr(counter); // Counter increment val-&gt;lru = (LFUGetTimeInMinutes()&lt;&lt;8) | counter; // Update time } Updating Access Time The first 16 bits are used to save the last access time: /** * Get the UNIX minute timestamp, keeping only the lowest 16 bits * Used to represent the last decrement time (LDT) */ unsigned long LFUGetTimeInMinutes(void) { return (server.unixtime/60) &amp; 65535; } Incrementing Access Counter The last 8 bits are a logarithmic counter, which stores the logarithm of the access frequency: #define LFU_INIT_VAL 5 // Logarithmic increment counter, with a maximum value of 255 uint8_t LFULogIncr(uint8_t counter) { if (counter == 255) return 255; double r = (double)rand()/RAND_MAX; double baseval = counter - LFU_INIT_VAL; if (baseval &lt; 0) baseval = 0; double p = 1.0/(baseval*server.lfu_log_factor+1); if (r &lt; p) counter++; return counter; } When server.lfu_log_factor = 10, the growth function of p = 1/((counter-LFU_INIT_VAL)*server.lfu_log_factor+1) is as follows: --- config: themeVariables: xyChart: plotColorPalette: &quot;#555&quot; --- xychart-beta title &quot;1/((x-5)*10+1)&quot; x-axis 4 --&gt; 20 y-axis 0 --&gt; 1 line [0, 1, 0.09091, 0.04762, 0.03226, 0.02439, 0.01961, 0.01639, 0.01408, 0.01235, 0.01099, 0.0099, 0.00901, 0.00826, 0.00763, 0.00709] The random floating-point number r generated by rand() follows a uniform distribution between 0 and 1. As the counter increases, the probability of successful self-increment decreases rapidly. The following table shows the number of hits required for the counter to saturate (255) under different lfu_log_factor settings: +--------+------------+------------+------------+------------+------------+ | factor | 100 hits | 1000 hits | 100K hits | 1M hits | 10M hits | +--------+------------+------------+------------+------------+------------+ | 0 | 104 | 255 | 255 | 255 | 255 | +--------+------------+------------+------------+------------+------------+ | 1 | 18 | 49 | 255 | 255 | 255 | +--------+------------+------------+------------+------------+------------+ | 10 | 10 | 18 | 142 | 255 | 255 | +--------+------------+------------+------------+------------+------------+ | 100 | 8 | 11 | 49 | 143 | 255 | +--------+------------+------------+------------+------------+------------+ Decaying Access Count Similarly, to ensure that expired hot data can be eliminated in time, Redis uses the following decay function: // Calculate the time elapsed since the last decay, in minutes public long LFUTimeElapsed(long ldt) { long now = LFUGetTimeInMinutes(); if (now &gt;= ldt) return now - ldt; return 65535 - ldt + now; } /** * Decay function, returns the LFU count after decay based on the LDT timestamp * Does not update the counter */ public long LFUDecrAndReturn(CacheEntry o) { long ldt = o.ttl() &gt;&gt;&gt; 8; long counter = o.ttl() &amp; 255; /** * Decay factor server.lfu_decay_time controls the decay rate of the counter * The access count decreases by 1 every server.lfu_decay_time minutes * Default value is 1 */ long num_periods = server.lfu_decay_time != 0 ? LFUTimeElapsed(ldt) / server.lfu_decay_time : 0; if (num_periods != 0) { counter = (num_periods &gt; counter) ? 0 : counter - num_periods; } return counter; } With 16 bits, the maximum number of minutes that can be stored is about 45 days, so the LDT timestamp resets every 45 days. Perform Eviction Whenever a client executes a command that generates new data, Redis checks whether the memory usage exceeds maxmemory. If it does, it tries to evict data according to maxmemory_policy: // The main method for Redis to handle commands. Before executing the command, various checks are performed, including handling OOM situations: public int processCommand(Client c) { // ... // When maxmemory is set, try to free memory (evict) if necessary if (server.maxmemory != 0 &amp;&amp; !server.lua_timedout) { boolean out_of_memory = (performEvictions() == EVICT_FAIL); // ... // If freeing memory fails and the command to be executed does not allow OOM (usually write commands) if (out_of_memory &amp;&amp; reject_cmd_on_oom) { rejectCommand(c, shared.oomerr); // Return OOM to the client return C_OK; } } } The actual deletion is performed by the performEvictions function: public int performEvictions() { // Loop to try to free up enough memory while (mem_freed &lt; mem_tofree) { // ... if ((server.maxmemory_policy &amp; (MAXMEMORY_FLAG_LRU | MAXMEMORY_FLAG_LFU)) != 0 || server.maxmemory_policy == MAXMEMORY_VOLATILE_TTL) { /** * Redis uses approximate LRU/LFU algorithms for eviction * Instead of traversing all records when evicting objects, samples of records are taken * EvictionPoolLRU is used to temporarily store sample data that should be evicted first */ EvictionPoolEntry[] pool = EvictionPoolLRU; // Get a bestkey that can be released according to the configured maxmemory-policy while (bestkey == null) { long total_keys = 0; long keys; // Traverse all DB instances for (i = 0; i &lt; server.dbnum; i++) { db = server.db + i; dict = (server.maxmemory_policy &amp; MAXMEMORY_FLAG_ALLKEYS) != 0 ? db.dict : db.expires; // Select the sampled set (keyspace or expire set) according to the policy if ((keys = dictSize(dict)) != 0) { // Sample and populate the pool evictionPoolPopulate(i, dict, db.dict, pool); total_keys += keys; } } // Traverse the records in the pool and free up memory for (k = EVPOOL_SIZE - 1; k &gt;= 0; k--) { if (pool[k].key == null) continue; bestdbid = pool[k].dbid; if ((server.maxmemory_policy &amp; MAXMEMORY_FLAG_ALLKEYS) != 0) { de = dictFind(server.db[pool[k].dbid].dict, pool[k].key); } else { de = dictFind(server.db[pool[k].dbid].expires, pool[k].key); } // Remove the record from the pool if (!pool[k].key.equals(pool[k].cached)) { sdsfree(pool[k].key); } pool[k].key = null; pool[k].idle = 0; if (de != null) { // Extract the key of the record bestkey = dictGetKey(de); break; } else { /* Ghost... Iterate again. */ } } } } // If bestkey is finally selected if (bestkey != null) { // If lazyfree-lazy-eviction is configured, try asynchronous deletion if (server.lazyfree_lazy_eviction) { dbAsyncDelete(db, keyobj); } else { dbSyncDelete(db, keyobj); } // ... } else { goto cant_free; /* nothing to free... */ } } } The evictionPoolPopulate function responsible for sampling: #define EVPOOL_SIZE 16 #define EVPOOL_CACHED_SDS_SIZE 255 struct evictionPoolEntry { unsigned long long idle; /* LRU idle time / LFU inverse frequency (prioritize records with larger values) */ sds key; /* Key involved in eviction selection */ sds cached; /* Cached key name */ int dbid; /* Database ID */ }; // The evictionPool array assists eviction operations static struct evictionPoolEntry *evictionPoolEntry; /** * Perform sampling in the given sampledict set * and record the records that should be evicted in evictionPool */ void evictionPoolPopulate(int dbid, dict *sampledict, dict *keydict, struct evictionPoolEntry *pool) { int j, k, count; dictEntry *samples[server.maxmemory_samples]; // Get maxmemory_samples randomly from sampledict count = dictGetSomeKeys(sampledict, samples, server.maxmemory_samples); // Traverse sample data for (j = 0; j &lt; count; j++) { // Calculate the idle time of the sample based on maxmemory_policy if (server.maxmemory_policy &amp; MAXMEMORY_FLAG_LRU) { idle = estimateObjectIdleTime(o); } else if (server.maxmemory_policy &amp; MAXMEMORY_FLAG_LFU) { idle = 255 - LFUDecrAndReturn(o); } else { // ... } k = 0; // Locate the index of the sample in evictionPool based on idle (samples are sorted in ascending order of idle) while (k &lt; EVPOOL_SIZE &amp;&amp; pool[k].key &amp;&amp; pool[k].idle &lt; idle) k++; if (k == 0 &amp;&amp; pool[EVPOOL_SIZE - 1].key != NULL) { // The sample idle time is not long enough, it does not participate in this round of eviction continue; } else if (k &lt; EVPOOL_SIZE &amp;&amp; pool[k].key == NULL) { // The corresponding position for the sample is empty, it can be inserted directly into that position } else { // The corresponding position for the sample is occupied, move other elements to make room for it } // Insert the sample data into its corresponding position k int klen = sdslen(key); if (klen &gt; EVPOOL_CACHED_SDS_SIZE) { pool[k].key = sdsdup(key); } else { // If the key length does not exceed EVPOOL_CACHED_SDS_SIZE, reuse the sds object } pool[k].idle = idle; pool[k].dbid = dbid; } } Java Implementation After understanding the above concepts, let&#39;s attempt to implement a thread-safe eviction strategy in Java. Designing the Bookkeeping Structure In a multi-threaded safe cache, it&#39;s crucial to minimize bookkeeping: On one hand, to avoid the overhead of maintaining additional state On the other hand, to reduce boundary cases where the system might end up inconsistent Thus, we can use a counter similar to Redis to track access patterns: /** * Cache Entry */ public abstract class CacheEntry { // CAS Updater private static final AtomicLongFieldUpdater&lt;CacheEntry&gt; TTL_UPDATER = AtomicLongFieldUpdater.newUpdater(CacheEntry.class, &quot;ttl&quot;); // Remaining survival time of cache records (unsigned long integer) private volatile long ttl; protected CacheEntry(long ttl) { this.ttl = ttl; } public long ttl() { return ttl; } //Support concurrent update of TTL public boolean casTTL(long old, long ttl) { return TTL_UPDATER.compareAndSet(this, old, ttl); } } /** * Elimination strategy */ public interface EvictStrategy { // Update the TTL of the cached record void updateTTL(CacheEntry node); // Calculate the TTL of the cached record based on the current timestamp long weightTTL(CacheEntry node, long now); } Determining Eviction Strategies Due to the constraints of the accounting structure, Redis can only avoid large-scale traversal and reduce the blocking of the main thread by instant deletion strategies through sampling. On the other hand, in situations where memory restrictions are not so strict, the lazy deletion strategy can be used to reduce the overhead of a single request: public abstract class EvictableCache { EvictStrategy evicting; // Eviction strategy /** * Updates the TTL of the cache entry when reading and writing cache records * @param entry The recently accessed cache entry */ void accessEntry(CacheEntry entry) { evicting.updateTTL(entry); } /** * Bulk eviction of caches * @param evictSamples Cache samples * @param evictNum Maximum number of evictions * @return Records that should be evicted */ Collection&lt;CacheEntry&gt; evictEntries(Iterable&lt;CacheEntry&gt; evictSamples, int evictNum) { // Compare the TTLs of two CacheEntries (evict records with smaller TTLs first) Comparator&lt;CacheEntry&gt; comparator = new Comparator&lt;CacheEntry&gt;() { final long now = System.currentTimeMillis(); public int compare(CacheEntry o1, CacheEntry o2) { long w1 = evicting.weightTTL(o1, now); long w2 = evicting.weightTTL(o2, now); return -Long.compareUnsigned(w1, w2); } }; // Use a max-heap to record the K CacheEntries with the smallest TTL PriorityQueue&lt;CacheEntry&gt; evictPool = new PriorityQueue&lt;&gt;(evictNum, comparator); Iterator&lt;CacheEntry&gt; iterator = evictSamples.iterator(); while (iterator.hasNext()) { CacheEntry entry = iterator.next(); if (evictPool.size() &lt; evictNum) { evictPool.add(entry); } else { // If the TTL of CacheEntry is smaller than the top record // Pop the top record and put the record with smaller TTL into the heap CacheEntry top = evictPool.peek(); if (comparator.compare(entry, top) &lt; 1) { evictPool.poll(); evictPool.add(entry); } } } return evictPool; } } Implementing Eviction Strategies FIFO Strategy /** * FIFO strategy */ public class FirstInFirstOut implements EvictStrategy { // Counter, incremented by 1 for each access operation private final AtomicLong counter = new AtomicLong(0); // Update TTL only on first access public void updateTTL(CacheEntry node) { node.casTTL(0, counter.incrementAndGet()); } // Returns the first access sequence number public long weightTTL(CacheEntry node, long now) { return node.ttl(); } } LRU Strategy /** * LRU-2 strategy */ public class LeastRecentlyUsed implements EvictStrategy { // Logical clock, incremented by 1 for each access operation private final AtomicLong clock = new AtomicLong(0); /** * Updates LRU time */ public void updateTTL(CacheEntry node) { long old = node.ttl(); long tick = clock.incrementAndGet(); long flag = old == 0 ? Long.MIN_VALUE: 0; // flag = Long.MIN_VALUE means put into History Queue // flag = 0 means put into LRU Cache long ttl = (tick &amp; Long.MAX_VALUE) | flag; while ((old &amp; Long.MAX_VALUE) &lt; tick &amp;&amp; ! node.casTTL(old, ttl)) { old = node.ttl(); ttl = tick &amp; Long.MAX_VALUE; // CAS failed indicates a second access } } /** * Calculates TTL based on LRU time */ public long weightTTL(CacheEntry node, long now) { long ttl = node.ttl(); return -1L - ttl; } } LFU Strategy /** * LFU-AgeDecay Strategy */ public class LeastFrequentlyUsed implements EvictStrategy { private static final int TIMESTAMP_BITS = 40; // 40 bits for recording access timestamps (to ensure no overflow for 34 years) private static final int FREQUENCY_BITS = 24; // 24 bits as a logarithmic counter (overflow of the counter can be ignored) private final long ERA = System.currentTimeMillis(); // Starting time (records the timestamp relative to this value) private final double LOG_FACTOR = 1; // Logarithmic factor private final TimeUnit DECAY_UNIT = TimeUnit.MINUTES; // Time decay unit /** * Update LFU counter and access time * Unlike Redis, the counter is not decayed during update */ public void updateTTL(CacheEntry node) { final long now = System.currentTimeMillis(); long old = node.ttl(); long timestamp = old &gt;&gt;&gt; FREQUENCY_BITS; long frequency = old &amp; (~0L &gt;&gt;&gt; TIMESTAMP_BITS); // Calculate access time long elapsed = Math.min(~0L &gt;&gt;&gt; FREQUENCY_BITS, now - ERA); while (timestamp &lt; elapsed) { // Increase access counter double rand = ThreadLocalRandom.current().nextDouble(); if (1./(frequency * LOG_FACTOR + 1) &gt; rand) { frequency++; frequency &amp;= (~0L &gt;&gt;&gt; TIMESTAMP_BITS); } // Update TTL long ttl = elapsed &lt;&lt; FREQUENCY_BITS | frequency &amp; (~0L &gt;&gt;&gt; TIMESTAMP_BITS); if (node.casTTL(old, ttl)) { break; } old = node.ttl(); timestamp = old &gt;&gt;&gt; FREQUENCY_BITS; frequency = old &amp; (~0L &gt;&gt;&gt; TIMESTAMP_BITS); } } /** * Return the decayed LFU counter */ public long weightTTL(CacheEntry node, long now) { long ttl = node.ttl(); long timestamp = ttl &gt;&gt;&gt; FREQUENCY_BITS; long frequency = ttl &amp; (~0L &gt;&gt;&gt; TIMESTAMP_BITS); long decay = DECAY_UNIT.toMinutes(Math.max(now - ERA, timestamp) - timestamp); return frequency - decay; } } "},{"slug":"redis-statistics-and-hyperloglog","title":"Redis Statistics With HyperLogLog","tags":["Redis"],"content":"Statistical features are a ubiquitous requirement in various applications. Consider the following scenario: To determine whether a particular feature should be retained in the next iteration, the product team requires statistics on the number of unique visitors (UV) for a page before and after its release as a decision-making basis. Now we need to choose a appropriate Redis data structure to implement statistical functions. Statistics In Redis Aggregated Statistics To accomplish the task of statistics, the most straightforward approach is to use a SET to store the user IDs visiting a page on a specific day, then perform statistical operations such as set difference and intersection to obtain the desired results: # UV on 2020-01-01 SADD page:uv:20200101 &quot;Alice&quot; &quot;Bob&quot; &quot;Tom&quot; &quot;Jerry&quot; # UV on 2020-01-02 SADD page:uv:20200102 &quot;Alice&quot; &quot;Bob&quot; &quot;Jerry&quot; &quot;Nancy&quot; # New users on 2020-01-02 SDIFFSTORE page:new:20200102 page:uv:20200102 page:uv:20200101 # Number of new users on 2020-01-02 SCARD page:new:20200102 # Retained users on 2020-01-02 SINTERSTORE page:rem:20200102 page:uv:20200102 page:uv:20200101 # Number of retained users on 2020-01-02 SCARD page:rem:20200102 Pros Intuitive and easy to understand operations, can reuse existing data sets Retains user visit details for more granular statistics Cons High memory consumption; for example, recording 100 million users, each with an ID length less than 44 bytes (using embstr encoding), would require at least 6GB of memory High computational complexity for SUNION, SINTER, and SDIFF operations, which may cause Redis instances to block under large data volumes; optimization options include: Selecting a slave node from the cluster dedicated to aggregation calculations Reading data into the client and performing aggregation statistics on the client side Binary Statistics When user IDs are consecutive integers, binary statistics can be implemented using BITMAP: # UV on 2020-01-01 SETBIT page:uv:20200101 0 1 # &quot;Alice&quot; SETBIT page:uv:20200101 1 1 # &quot;Bob&quot; SETBIT page:uv:20200101 2 1 # &quot;Tom&quot; SETBIT page:uv:20200101 3 1 # &quot;Jerry&quot; # UV on 2020-01-02 SETBIT page:uv:20200102 0 1 # &quot;Alice&quot; SETBIT page:uv:20200102 1 1 # &quot;Bob&quot; SETBIT page:uv:20200102 3 1 # &quot;Jerry&quot; SETBIT page:uv:20200102 4 1 # &quot;Nancy&quot; # New users on 2020-01-02 BITOP NOT page:not:20200101 page:uv:20200101 BITOP AND page:new:20200102 page:uv:20200102 page:not:20200101 # Number of new users on 2020-01-02 BITCOUNT page:new:20200102 # Retained users on 2020-01-02 BITOP AND page:rem:20200102 page:uv:20200102 page:uv:20200101 # Number of retained users on 2020-01-02 BITCOUNT page:new:20200102 Pros Low memory consumption; only 12MB memory required for recording 100 million users Fast statistics; computers efficiently handle bitwise XOR operations Cons Requirement for specific data types; only handles integer sets Cardinality Statistics Both previous methods provide accurate statistical results but encounter issues as the collection grows: Linear increase in storage memory required as the statistical set grows Increased cost of determining whether a newly added element exists in the set as the set grows Consider the following scenario: Product teams might only care about UV increments, in which case, the ultimate result desired is the number of unique users in the access set, not which users are included in the access set For merely counting the number of unique elements in a set without concerning the content of the set, we term it cardinality counting. For this specific statistical scenario, Redis provides support for cardinality statistics with the HyperLogLog type: # UV on 2020-01-01 PFADD page:uv:20200101 &quot;Alice&quot; &quot;Bob&quot; &quot;Tom&quot; &quot;Jerry&quot; PFCOUNT page:uv:20200101 # UV on 2020-01-02 PFADD page:uv:20200102 &quot;Alice&quot; &quot;Bob&quot; &quot;Tom&quot; &quot;Jerry&quot; &quot;Nancy&quot; PFCOUNT page:uv:20200102 # Total UV on 2020-01-01 and 2020-01-02 PFMERGE page:uv:union page:uv:20200101 page:uv:20200102 PFCOUNT page:uv:union Pros HyperLogLog requires a fixed amount of space for counting cardinality. It only needs 12KB memory to estimate the cardinality of nearly $2^{64}$ elements. Cons HyperLogLog&#39;s statistics are probabilistic, resulting in certain errors. It&#39;s not suitable for precise statistical scenarios. HyperLogLog Analysis Probability Estimation HyperLogLog is a probability-based statistical method. How can we understand this? Let&#39;s conduct an experiment: continuously flipping a fair two-sided coin until it lands heads. Representing heads and tails as 0 and 1 respectively, the experiment&#39;s results can be represented as a binary string: +-+ 1st flip (heads) |1| +-+ +--+ 2nd flip (heads) |01| +--+ +---+ 3rd flip (heads) |001| +---+ +---------+ kth flip (heads) |000...001| (total of k-1 zeros) +---------+ Since the probability of getting heads each time is $\\frac{1}{2}$, the experiment&#39;s probability of ending at the kth flip is $(\\frac{1}{2})^k$ (the probability that the first 1 appears in the kth position of the binary string). After conducting n experiments, let&#39;s denote the number of flips in each experiment as $k_1, k_2,\\cdots,k_n$, with the maximum value being $k_{max}$. In an ideal scenario, $k_{max} = log_2(n)$, and conversely, we can estimate the total number of experiments $n = 2^{k_{max}}$ using $k_{max}$. Handling Extreme Cases In practical experiments, extreme cases inevitably arise, such as getting 10 consecutive tails on the 1st flip. If we estimate using the previous formula, we would erroneously conclude that 1000 experiments have been conducted, which clearly doesn&#39;t match reality. To improve estimation accuracy, we can conduct group experiments simultaneously using m coins. Then calculate the average of these m groups&#39; maximum values $\\hat{k}{max} = \\frac{\\sum{i=0}^{m}{k_{max}}}{m}$, which provides a more accurate estimate of the actual experiment count $\\hat{n}=2^{\\hat{k}_{max}}$. Cardinality Counting Based on the previous analysis, we can summarize the following experience: The maximum position of the first 1 in the binary string, $k_{max}$, can be used to estimate the actual experiment count $n$ HyperLogLog adopts this idea to count the number of distinct elements in a set: Map each element in the set to a fixed-length binary string with a hash function Improve accuracy using group statistics by distributing binary strings into m different buckets: The first $log_2{m}$ bits of the binary string determine the bucket the element belongs to The remaining bits of the binary string identify the position of the first 1, denoted as $k$; each bucket only stores the maximum value $k_{max}$ When estimating the number of elements in the set, use the formula $\\hat{n}=2^{\\hat{k}_{max}}$. Here&#39;s an example: Using a HyperLogLog implementation with an 8-bit output hash function and grouping statistics into 4 buckets, we count the UVs of users Alice, Bob, Tom, Jerry, and Nancy: Binary String Bucket Calculate k | | | V V V +---------+ hash(&quot;Alice&quot;) =&gt; |01|101000| =&gt; bucket=1, k=1 +---------+ Group Statistics k_max +---------+ hash(&quot;Bob&quot;) =&gt; |11|010010| =&gt; bucket=3, k=2 +----------+----------+----------+----------+ +---------+ | bucket_0 | bucket_1 | bucket_2 | bucket_3 | +---------+ ==&gt; +----------+----------+----------+----------+ hash(&quot;Tom&quot;) =&gt; |10|001000| =&gt; bucket=2, k=3 | k_max= 1 | k_max= 2 | k_max= 3 | k_max= 2 | +---------+ +----------+----------+----------+----------+ +---------+ hash(&quot;Jerry&quot;) =&gt; |00|111010| =&gt; bucket=0, k=1 +---------+ +---------+ hash(&quot;Nancy&quot;) =&gt; |01|010001| =&gt; bucket=1, k=2 +---------+ After group counting, we estimate the set&#39;s cardinality using the formula $2^{\\hat{k}_{max}}= 2^{(\\frac{1+2+3+2}{4})} = 4$. Error Analysis In Redis&#39;s implementation, for a given input string, we first obtain a 64-bit hash value: The first 14 bits locate the bucket (16384 buckets in total) The remaining 50 bits represent the binary string corresponding to the element (used to update the maximum value $k_{max}$ of the first appearance of 1) With a 64-bit output hash function, there&#39;s practically no limit on the cardinality of the countable set. The standard error calculation formula for HyperLogLog is $\\frac{1.04}{\\sqrt{m}}$ ($m$ being the number of buckets). Using this, Redis&#39;s implementation yields a standard error of $0.81%$. The following graph illustrates the relationship between statistical error and cardinality: The red and green lines represent two different datasets The x-axis represents the actual cardinality of the set The y-axis represents the relative error (percentage) Analyzing this graph leads to several conclusions: Statistical error is independent of the data&#39;s distribution characteristics The smaller the set&#39;s cardinality, the smaller the error (higher precision with small cardinalities) The larger the set&#39;s cardinality, the larger the error (saving resources with large cardinalities) "},{"slug":"redis-persistence","title":"Redis Persistence","tags":["Redis"],"content":"As an in-memory database, Redis still provides persistence mechanisms for two primary purposes: Safety: Ensuring data is not lost in the event of process crashes. Backup: Facilitating data migration and quick recovery. Redis provides two main persistence mechanisms: RDB Snapshot:Complete state of the database at a certain point in time, storing key-value pairs. AOF Log: Operations that change the state of the database, storing commands. RDB Snapshot There are two ways to generate RDB snapshots: Regularly by the service process. Manually by executing the SAVE or BGSAVE commands. Regular Generation Users can control the automatic generation of RDB snapshots by setting save points: save 900 1 # At least 1 key change in the last 15 minutes save 300 10 # At least 10 key changes in the last 5 minutes save 60 10000 # At least 10000 key changes in the last 1 minute struct saveparam { time_t seconds; // Number of seconds int changes; // Number of changes }; struct redisServer { // ... struct saveparam *saveparams; /* RDB save point array */ int saveparamslen; /* Number of save points */ long long dirty; /* Number of changes since the last snapshot */ time_t lastsave; /* UNIX timestamp of the last snapshot taken */ } +---------------+ | redisServer | +---------------+ +---------------+---------------+---------------+ | saveparams | -&gt; | saveparams[0] | saveparams[1] | saveparams[2] | +---------------+ +---------------+---------------+---------------+ | saveparamslen | | seconds | seconds | seconds | | 3 | | 900 | 300 | 60 | +---------------+ +---------------+---------------+---------------+ | dirty | | changes | changes | changes | | 120 | | 1 | 10 | 10000 | +---------------+ +---------------+---------------+---------------+ | lastsave | | 1378270800 | +---------------+ Automatic saving process: Every time a database modification command is executed, the dirty counter records the number of changes caused by that command. Redis&#39;s periodic task serverCron periodically checks if the save point conditions are met. int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) { // ... for (j = 0; j &lt; server.saveparamslen; j++) { struct saveparam *sp = server.saveparams+j; if (server.dirty &gt;= sp-&gt;changes &amp;&amp; // Check if the number of changes is enough server.unixtime-server.lastsave &gt; sp-&gt;seconds) // Check the latest snapshot time { // If the current state meets the savepoint settings, print the log and start executing BGSAVE serverLog(LL_NOTICE,&quot;%d changes in %d seconds. Saving...&quot;, sp-&gt;changes, (int)sp-&gt;seconds); // ... // Trigger BGSAVE rdbSaveBackground(server.rdb_filename,rsiptr); break; } } } Manual Backup To avoid performance fluctuations during peak traffic, the automatic snapshot generation is often disabled in production environments. To ensure data safety, operations personnel use scheduled scripts to execute the BGSAVE command to back up Redis data when the system is idle. int rdbSaveBackground(char *filename, rdbSaveInfo *rsi) { // ... if ((childpid = redisFork(CHILD_TYPE_RDB)) == 0) { // Fork child process /* The child process start generating RDB snapshots... */ int retval = rdbSave(filename,rsi); // ... } else { /* The main process returns directly without blocking */ serverLog(LL_NOTICE,&quot;Background saving started by pid %d&quot;,childpid); updateDictResizePolicy(); // During the snapshot generation, prohibit rehash operations on dict // ... return C_OK; } } The RDB file is generated by the child process. The copy-on-write optimization feature of the operating system determines that the memory between the parent and child processes is logically independent. Therefore, any modification from the main process will not be included in the RDB file, which ensures the consistency of the state recorded by the RDB. RDB Files The RDB snapshot is a binary file: # RDB files for n databases +-------+------------+-------+-----+-------+-----+-----------+ | REDIS | db_version | db[0] | ... | db[n] | EOF | check_sum | +-------+------------+-------+-----+-------+-----+-----------+ # Each database contains arbitrary key-value pairs +-------+ +----------+---+------------+-----+------------+ | db[0] | =&gt; | SELECTDB | 0 | kv_pair[0] | ... | kv_pair[n] | +-------+ +----------+---+------------+-----+------------+ # Key-value pairs, with constant TYPE indicating the encoding type of the value +---------+ +------+-----+-------+ | kv_pair | =&gt; | TYPE | key | value | +---------+ +------+-----+-------+ # Key-value pairs with expiration time, with constant EXPIRETIME_MS followed by an 8-byte timestamp +------------------+ +---------------+--------------+------+-----+-------+ | kv_pair_with_ttl | =&gt; | EXPIRETIME_MS | ms_timestamp | TYPE | key | value | +------------------+ +---------------+--------------+------+-----+-------+ The RDB snapshot stores the complete state of the database in compact format, making it suitable for data backup: Convenient for transmission over the network to remote data centers for disaster recovery. Using the RESTORE command to load the RDB snapshot can initialize data or perform emergency rollbacks. AOF Log The process of generating RDB snapshots is time-consuming and cannot be performed frequently with BGSAVE. However, if state changes are not persisted in time, a process crash could result in a loss of significant amounts of unpersisted data. To avoid the overhead of full backups, Redis supports persisting state changes to the AOF log incrementally, reducing pressure on disk I/O. As the AOF log is written by the main thread, the flushing strategy significantly affects Redis&#39;s performance. The following configuration options control this behavior: appendonly no # Enable AOF # Flushing policy # always: Flush immediately after each change # everysec: Flush once per second # no: Let the OS decide when to flush appendfsync everysec struct redisServer { // ... int aof_enabled; /* AOF switch */ int aof_state; /* AOF state (on, off, waiting for rewrite)*/ int aof_fsync; /* fsync policy */ sds aof_buf; /* AOF buffer */ time_t aof_flush_postponed_start; /* AOF delayed flush UNIX timestamp */ } Append Commands Whenever a command is successfully executed, it is written to the AOF cache through the following call chain: processCommand -&gt; call -&gt; propagate -&gt; feedAppendOnlyFile: void feedAppendOnlyFile(struct redisCommand *cmd, int dictid, robj **argv, int argc) { // Append the command to the end of the buffer and write it to the AOF file before returning the result to the client if (server.aof_state == AOF_ON) server.aof_buf = sdscatlen(server.aof_buf,buf,sdslen(buf)); // If a child thread is performing AOF rewrite, it will record the newly added modifications to a new AOF log during this period if (server.aof_child_pid != -1) aofRewriteBufferAppend((unsigned char*)buf,sdslen(buf)); } Log Flushing Before the serverCron event loop ends, flushAppendOnlyFile is called to write the commands in the buffer to the AOF log file: int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) { // ... // AOF delayed flush: Execute fsync once per cron loop if (server.aof_flush_postponed_start) flushAppendOnlyFile(0); } void flushAppendOnlyFile(int force) { ssize_t nwritten; int sync_in_progress = 0; if (sdslen(server.aof_buf) == 0) { // Return directly if the buffer is empty // ... return; } // Write the commands to the AOF file, but not yet flushed nwritten = aofWrite(server.aof_fd,server.aof_buf,sdslen(server.aof_buf)); server.aof_flush_postponed_start = 0; // Write completed, reset delayed flush timestamp to avoid triggering again // ... if (server.aof_fsync == AOF_FSYNC_ALWAYS) { // If the flushing policy is always, then fsync immediately redis_fsync(server.aof_fd); server.aof_fsync_offset = server.aof_current_size; server.aof_last_fsync = server.unixtime; } else if ((server.aof_fsync == AOF_FSYNC_EVERYSEC &amp;&amp; server.unixtime &gt; server.aof_last_fsync)) { // If the flushing policy is everysec, fsync is done asynchronously by a background process if (!sync_in_progress) { aof_background_fsync(server.aof_fd); server.aof_fsync_offset = server.aof_current_size; } server.aof_last_fsync = server.unixtime; } } It&#39;s worth noting that if an error occurs during writing to the AOF file and the persistence strategy is always, the Redis process will exit directly. Log Rewriting As commands are continuously received, the AOF file grows larger, leading to the following issues: File systems have limitations on file size, unable to store excessively large files. During failure recovery, executing commands from the AOF log one by one can be very slow if the log file is too large. One significant reason for this problem is the existence of redundant commands: # Executing commands 127.0.0.1:6379&gt; INCR counter (integer) 1 127.0.0.1:6379&gt; INCR counter (integer) 2 127.0.0.1:6379&gt; INCR counter (integer) 3 # Corresponding AOF log *2\\r\\n$6\\r\\nSELECT\\r\\n$1\\r\\n0\\r\\n *2\\r\\n$4\\r\\nINCR\\r\\n$7\\r\\ncounter\\r\\n *2\\r\\n$4\\r\\nINCR\\r\\n$7\\r\\ncounter\\r\\n *2\\r\\n$4\\r\\nINCR\\r\\n$7\\r\\ncounter\\r\\n Redis provides a rewrite mechanism which significantly reduces unnecessary redundant commands: # Rewrite the log and output it to a new file 127.0.0.1:6379&gt; BGREWRITEAOF # After rewriting, 3 INCR commands reduce to 1 SET command *2\\r\\n$6\\r\\nSELECT\\r\\n$1\\r\\n0\\r\\n *3\\r\\n$3\\r\\nSET\\r\\n$7\\r\\ncounter\\r\\n$1\\r\\n3 In addition to manually executing the BGREWRITEAOF command, Redis also supports automatic triggering of AOF rewriting. The following configuration options can control this behavior: # Rewrite strategy no-appendfsync-on-rewrite no # Disable fsync during AOF rewriting auto-aof-rewrite-percentage 100 # Trigger AOF rewriting when the growth percentage exceeds this value auto-aof-rewrite-min-size 64mb # Trigger AOF rewriting when the log file size exceeds this value struct redisServer { // ... int aof_no_fsync_on_rewrite; /* Disable fsync during AOF rewriting */ int aof_rewrite_perc; /* Growth percentage to trigger AOF rewriting */ off_t aof_rewrite_min_size; /* Minimum size to trigger AOF rewriting */ int aof_rewrite_scheduled; /* Indicates whether a rewrite operation is waiting for BGSAVE to complete */ list *aof_rewrite_buf_blocks; /* AOF rewrite buffer */ } Periodic task serverCron periodically checks whether the conditions for rewriting are met: int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) { /* Delayed rewrite: if no background processes are running and BGWRITEAOF command is received, delay it until BGSAVE is completed to avoid contention for disk I/O resources */ if (!hasActiveChildProcess() &amp;&amp; // No child processes performing background operations, indicating BGSAVE has completed server.aof_rewrite_scheduled) // There is a BGWRITEAOF command waiting to be executed { rewriteAppendOnlyFileBackground(); } // ... if (server.aof_state == AOF_ON &amp;&amp; server.aof_rewrite_perc &amp;&amp; server.aof_current_size &gt; server.aof_rewrite_min_size) // Check if the log size meets the criteria { // Check if the growth percentage meets the criteria long long base = server.aof_rewrite_base_size ? server.aof_rewrite_base_size : 1; long long growth = (server.aof_current_size*100/base) - 100; if (growth &gt;= server.aof_rewrite_perc) { // If the current state meets the rewrite conditions, log and start BGREWRITEAOF serverLog(LL_NOTICE,&quot;Starting automatic rewriting of AOF on %lld%% growth&quot;,growth); rewriteAppendOnlyFileBackground(); } } } int rewriteAppendOnlyFileBackground(void) { // ... if ((childpid = redisFork(CHILD_TYPE_AOF)) == 0) { /* Child process responsible for rewriting AOF log */ char tmpfile[256]; if (rewriteAppendOnlyFile(tmpfile) == C_OK) { // ... } } else { /* Main process returns without blocking */ serverLog(LL_NOTICE, &quot;Background append only file rewriting started by pid %d&quot;,childpid); updateDictResizePolicy(); return C_OK; } } During rewriting, the main thread continues to serve normally, and database state changes still occur. However, the AOF rewritten by the child process will not include these changes. Therefore, these new commands will be appended to both the AOF buffer server.aof_buf and the rewrite buffer server.aof_rewrite_buf_blocks simultaneously. Once the rewriting process of the child process is completed, the rewrite buffer is appended to the rewritten AOF log. Furthermore, to avoid contention with the rewriting process for disk I/O, you can disable the main process from calling fsync to persist the AOF log during rewriting by setting aof_no_fsync_on_rewrite. Comparison RDB Snapshot Pros: Compact file structure, space-saving, easy to transfer, quick recovery. Cons: Snapshot generation overhead is only related to the database size; when the database is large, snapshot generation is time-consuming, and frequent operations are not feasible. AAOF Log Pros: Records changes with fine granularity, minimal pressure on disk I/O, allows frequent persistence, extremely low probability of data loss. Cons: Slow recovery speed; log recording overhead is related to update frequency, frequent updates can lead to increased disk I/O pressure. "},{"slug":"redis-replication","title":"Redis Replication","tags":["Redis"],"content":"To ensure service availability, Redis provides replication mechanism to maintain consistent data state across multiple processes. Observing Replication with tcpdump Redis supports a master-slave replication architecture, which is simplified to a single SLAVEOF command. Let&#39;s use this command to analyze the replication mechanism of Redis master and slave. Start two services on the local machine using redis-server, and then observe the interaction between master and slave using tcpdump: redis-server --port 6379 --requirepass 123456 # Start master redis-server --port 6380 --masterauth 123456 # Start slave tcpdump -t -i lo0 host localhost and port 6379 | awk -F &#39;]&#39; &#39;{print $1&quot;]&quot;$3}&#39; # Establish a synchronous connection on localhost:6380 to localhost:6379 and enter Full-ReSync phase localhost.59297 &gt; localhost.6379: Flags [S] localhost.6379 &gt; localhost.59297: Flags [S.] localhost.59297 &gt; localhost.6379: Flags [P.] &quot;PING&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;NOAUTH Authentication required.&quot; localhost.59297 &gt; localhost.6379: Flags [P.] &quot;AUTH 123456&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;OK&quot; localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF listening-port 6380&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;OK&quot;: localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF capa eof&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;OK&quot;: localhost.59297 &gt; localhost.6379: Flags [P.] &quot;PSYNC ? -1&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;FULLRESYNC 8efb6ca4edf1258c05a5ced43b0c73fe4deb1908 1&quot; localhost.6379 &gt; localhost.59297: Flags [P.] [|RESP: localhost.6379 &gt; localhost.59297: Flags [P.] &quot;REDIS0007M-z^Iredis-ver^F3.2.11M-z&quot; [|RESP # After Full-ReSync enter the Propagation phase localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;1&quot; localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;1&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;PING&quot; localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;15&quot; localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;15&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;SELECT&quot; &quot;0&quot; &quot;SET&quot; &quot;KEY&quot; &quot;VALUE&quot; localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;85&quot; localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;85&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;SET&quot; &quot;KEY2&quot; &quot;VALUE2&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;MSET&quot; &quot;KEY3&quot; &quot;VALUE3&quot; &quot;KEY4&quot; &quot;VALUE4&quot; &quot;KEY5&quot; &quot;VALUE5&quot; localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;256&quot; localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;256&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;PING&quot; localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;270&quot; localhost.59297 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;270&quot; # Execute DEBUG SLEEP 60 on localhost:6380 to simulate network interruption localhost.6379 &gt; localhost.59297: Flags [P.] &quot;PING&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;SET&quot; &quot;KEY6&quot; &quot;VALUE6&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;SET&quot; &quot;KEY7&quot; &quot;VALUE7&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;PING&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;MSET&quot; &quot;KEY8&quot; &quot;VALUE8&quot; &quot;KEY9&quot; &quot;VALUE9&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;PING&quot; localhost.6379 &gt; localhost.59297: Flags [P.] &quot;PING&quot; localhost.59297 &gt; localhost.6379: Flags [.] localhost.59297 &gt; localhost.6379: Flags [R.] # After the old connection is disconnected, new connection establised and enter the Partial-ReSync stage. localhost.59313 &gt; localhost.6379: Flags [S] localhost.6379 &gt; localhost.59313: Flags [S.] localhost.59313 &gt; localhost.6379: Flags [P.] &quot;PING&quot; localhost.6379 &gt; localhost.59313: Flags [P.] &quot;NOAUTH Authentication required.&quot; localhost.59313 &gt; localhost.6379: Flags [P.] &quot;AUTH 123456&quot; localhost.6379 &gt; localhost.59313: Flags [P.] &quot;OK&quot; localhost.59313 &gt; localhost.6379: Flags [P.] &quot;REPLCONF listening-port 6380&quot; localhost.6379 &gt; localhost.59313: Flags [P.] &quot;OK&quot; localhost.59313 &gt; localhost.6379: Flags [P.] &quot;REPLCONF capa eof&quot; localhost.6379 &gt; localhost.59313: Flags [P.] &quot;OK&quot; localhost.59313 &gt; localhost.6379: Flags [P.] &quot;PSYNC 8efb6ca4edf1258c05a5ced43b0c73fe4deb1908 271&quot; localhost.6379 &gt; localhost.59313: Flags [P.] &quot;CONTINUE&quot; localhost.6379 &gt; localhost.59313: Flags [P.] &quot;PING&quot; &quot;PING&quot; &quot;SET&quot; &quot;KEY6&quot; &quot;VALUE6&quot; &quot;PING&quot; &quot;SET&quot; &quot;KEY7&quot; &quot;VALUE7&quot; &quot;PING&quot; &quot;MSET&quot; &quot;KEY8&quot; &quot;VALUE8&quot; &quot;KEY9&quot; &quot;VALUE9&quot; &quot;PING&quot; &quot;PING&quot; localhost.59313 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;519&quot; localhost.59313 &gt; localhost.6379: Flags [P.] &quot;REPLCONF&quot; &quot;ACK&quot; &quot;519&quot; localhost.6379 &gt; localhost.59313: Flags [P.] &quot;PING&quot; localhost.59313 &gt; localhost.6379: Flags [P.]: &quot;REPLCONF&quot; &quot;ACK&quot; &quot;533&quot; localhost.59313 &gt; localhost.6379: Flags [P.]: &quot;REPLCONF&quot; &quot;ACK&quot; &quot;533&quot; The replication process can be divided into 3 stages: Full-ReSync Command-Propagate Partical-ReSync +----------------------+ +---------------------+ | redisServer (master) | | redisServer (slave) | | localhost:6379 | | localhost:6380 | +----------------------+ +---------------------+ | slaves | | master | +----------------------+ +---------------------+ | | +----------------+ +-------------+ | redisClient[?] | | redisClient | +----------------+ +-------------+ | ^ &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; PING &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; | | Step 1 : Check socket and master status | &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; PONG / NOAUTH &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; | | | &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; AUTH &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; | | Step 2 : Authentication | &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; OK &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; | | | &lt;&lt;&lt;&lt; REPLCONF listening-port [port] &lt;&lt;&lt;&lt;&lt; | | Step 3 : Send slave port Full-ReSync &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; OK &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; | | | &lt;&lt;&lt;&lt;&lt;&lt; REPLCONF capa [eof|psync2] &lt;&lt;&lt;&lt;&lt;&lt;&lt; | | Step 4 : Check command compatibility | &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; OK &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; | | | &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; PSYNC ? -1 &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; | | | &gt;&gt;&gt;&gt;&gt;&gt; FULLRESYNC [replid] [offset] &gt;&gt;&gt;&gt;&gt; Step 6 : Execute full sync | V | BGSAVE | V v &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; RDB Snapshot &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; ^ &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; REPLCONF ACK [offset] &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; | &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; COMMAND 1 &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; | &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; COMMAND 2 &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; | &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; REPLCONF ACK [offset+?] &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; Heartbeat check &amp; Command propagation Command-Propagation &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; PING &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; | &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; COMMAND 3 &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; | &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; REPLCONF ACK [offset+?] &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; | &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; REPLCONF ACK [offset+?] &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; v &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; PING &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; ^ ========================================= | ====== The Same With Full-ReSync ======== | ========================================= | | Partical-ReSync &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; PSYNC [replid] [offset] &lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt; Partial sync after reconnection | | | &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; CONTINUE &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; | &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; COMMAND N &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; v &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; COMMAND ... &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; PSYNC Command The original Redis command for synchronization was SYNC. Each time a slave reconnected, it would generate, transmit, and load the entire RDB snapshot, which consumed a significant amount of machine resources and network bandwidth. To address this issue, later versions of Redis added the PSYNC command, which supports the following two synchronization modes: Full-ReSync The slave connects to the master for the first time. The state difference between the master and slave is too large. Partical-ReSync A network jitter causes the synchronization connection to be disconnected and reconnected. The sentinel mechanism causes a change in the master node. Data Structure Let&#39;s take a look at the data structures related to PSYNC in redisServer: struct redisServer { /* * Node ID and replication offset * * If the current node is a master * server.replid is server.runid * * If the current node was originally a master and was converted to a slave node * server.replid and server.master_repl_offset are overwritten with the synchronization information of the new master * * If the current node was originally a slave and was promoted to a master node * rserver.eplid2 and server.second_replid_offset will record the synchronization information when the current node was a slave */ char runid[CONFIG_RUN_ID_SIZE+1]; /* Current node&#39;s runtime ID (changes each time it restarts) */ char replid[CONFIG_RUN_ID_SIZE+1]; /* The runid of the current master node */ char replid2[CONFIG_RUN_ID_SIZE+1]; /* The runid of the master node that the current master node was connected to when it was a slave node */ long long master_repl_offset; /* Replication offset of the current master node */ long long second_replid_offset; /* Replication offset of the current master node when it was a slave node */ /* * Replication backlog buffer * * The master maintains a single global server.repl_backlog, which is shared by all slave nodes * To reduce memory usage, server.repl_backlog is only created on demand when a slave node is present */ char *repl_backlog; /* Replication backlog buffer (circular buffer)*/ long long repl_backlog_size; /* Backlog size */ long long repl_backlog_histlen; /* Backlog data length */ long long repl_backlog_idx; /* Tail of the backlog buffer (writable position)*/ long long repl_backlog_off; /* Synchronization offset (master offset) corresponding to the first byte of the backlog buffer */ } RunID Whether master or slave, each Redis server generates a 40-character hexadecimal string as its runid when it starts up: When a slave requests synchronization for the first time, it saves the server.runid returned by the master to server.replid When a slave requests synchronization again, it sends the previously saved server.replid to the master: If this ID is not consistent with the master&#39;s current server.runid, a full resynchronization must be performed If this ID is consistent with the master&#39;s current server.runid, a partial synchronization operation can be attempted Replication Offset Both the master and slave maintain a replication offset offset in bytes, which can be used to determine whether the state of the master and slave is consistent: The master increases its replication offset by N after transmitting N bytes of data to the slave When the slave receives N bytes of data from the master, it increases its replication offset by N When the master receives the offset in the REPLCONF ACK, it can use it to determine whether any data sent to the slave has been lost and resends the lost data. Backlog Buffer The master maintains a fixed-length backlog queue: The master puts commands into this queue when it transmits them to the slave, so some of the latest commands are retained in the buffer When a slave issues a synchronization request, and the data which offset behind behind the slave&#39;s offset exists in the backlog buffer, the master will perform partial synchronization Synchronization Process Slave Perspective After receiving the SLAVEOF command, the slave calls replicaofCommand to start master-slave synchronization: void replicaofCommand(client *c) { // ... if (!strcasecmp(c-&gt;argv[1]-&gt;ptr,&quot;no&quot;) &amp;&amp; !strcasecmp(c-&gt;argv[2]-&gt;ptr,&quot;one&quot;)) { if (server.masterhost) { // If the received command is SLAVE NO ONE, then disconnect from the master-slave synchronization // ... } } else { if (c-&gt;flags &amp; CLIENT_SLAVE) { return; // If the client is already a slave node, then reject this command } if (server.masterhost &amp;&amp; !strcasecmp(server.masterhost,c-&gt;argv[1]-&gt;ptr) &amp;&amp; server.masterport == port) { return; // If the master node specified in SLAVEOF has already been connected to, then return directly } // If no master node has been connected yet, establish a TCP connection according to masterhost and masterport // And register the listener function syncWithMaster } } void syncWithMaster(connection *conn) { // Send the PING command to the master node if (server.repl_state == REPL_STATE_CONNECTING) { server.repl_state = REPL_STATE_RECEIVE_PONG; err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,&quot;PING&quot;,NULL); // Send the PING command // ... } // Listen to the master&#39;s response to the PING command if (server.repl_state == REPL_STATE_RECEIVE_PONG) { if (err[0] != &#39;+&#39; &amp;&amp; strncmp(err,&quot;-NOAUTH&quot;,7) != 0 &amp;&amp; strncmp(err,&quot;-NOPERM&quot;,7) != 0 &amp;&amp; strncmp(err,&quot;-ERR operation not permitted&quot;,28) != 0) { goto error; } server.repl_state = REPL_STATE_SEND_AUTH; // Only handle the master&#39;s response values of PONG, NOAUTH, NOPERM } // According to the master&#39;s response value to PING, determine whether authorization is required if (server.repl_state == REPL_STATE_SEND_AUTH) { if (server.masteruser &amp;&amp; server.masterauth) { err = sendSynchronousCommand(SYNC_CMD_WRITE,conn,&quot;AUTH&quot;, server.masteruser,server.masterauth,NULL); // Send the AUTH command // ... server.repl_state = REPL_STATE_RECEIVE_AUTH; } else { // If the server.masteruser and server.masterauth authorization information is not set, skip AUTH server.repl_state = REPL_STATE_SEND_PORT; } } // Omit the following steps: // Use the REPLCONF listening-port command to inform the slave&#39;s port to the master // Use the REPLCONF ip-address command to inform the slave&#39;s IP to the master // Use the REPLCONF capa eof / capa psync2 command to inform the slave&#39;s compatibility (supported features) to the master // Start sending PSYNC command if (server.repl_state == REPL_STATE_SEND_PSYNC) { if (slaveTryPartialResynchronization(conn,0) == PSYNC_WRITE_ERROR) { goto write_error; } server.repl_state = REPL_STATE_RECEIVE_PSYNC; return; } // Read the response to the PSYNC command psync_result = slaveTryPartialResynchronization(conn,1); // If the response is CONTINUE, skip full synchronization if (psync_result == PSYNC_CONTINUE) return; // If the return value is PSYNC_FULLRESYNC or PSYNC_NOT_SUPPORTED // Start performing full synchronization, register readSyncBulkPayload to listen for RDB file download if (connSetReadHandler(conn, readSyncBulkPayload) == C_ERR) { // ... goto error; } server.repl_state = REPL_STATE_TRANSFER; // ... } int slaveTryPartialResynchronization(connection *conn, int read_reply) { if (!read_reply) { if (server.cached_master) { // server.cached_master not empty, try to perform partial synchronization psync_replid = server.cached_master-&gt;replid; } else { psync_replid = &quot;?&quot;; // server.cached_master is empty, only full synchronization can be performed } //Initiate PSYNC command reply = sendSynchronousCommand(SYNC_CMD_WRITE,conn,&quot;PSYNC&quot;,psync_replid,psync_offset,NULL); // ... return PSYNC_WAIT_REPLY; } reply = sendSynchronousCommand(SYNC_CMD_READ,conn,NULL); // Read PSYNC response // If the master responds to FULLRESYNC, full synchronization will be performed directly if (!strncmp(reply,&quot;+FULLRESYNC&quot;,11)) { // ... return PSYNC_FULLRESYNC; } // If master responds with CONTINUE, try to perform partial synchronization if (!strncmp(reply,&quot;+CONTINUE&quot;,9)) { // ... return PSYNC_CONTINUE; } // master is temporarily unable to process the PSYNC command —&gt; PSYNC_TRY_LATER // master does not support PSYNC command -&gt; PSYNC_NOT_SUPPORTED } Master Perspective After receiving the PSYNC command, the master calls syncCommand to start the synchronization process: void syncCommand(client *c) { // Received the PSYNC command sent by the slave if (!strcasecmp(c-&gt;argv[0]-&gt;ptr,&quot;psync&quot;)) { if (masterTryPartialResynchronization(c) == C_OK) { return; // No need for full synchronization, return directly } } // If the code runs to this point, it means that partial synchronization fails and full synchronization is required // The master will execute the BGSAVE command to generate a snapshot and transmit it to the slave // There are two ways to synchronize the RDB snapshot: // Disk-backed: Generate RDB snapshot files on disk and then transmit them to the slave // Diskless: Write RDB snapshot data directly to the slave socket } int masterTryPartialResynchronization(client *c) { long long psync_offset; // The latest synchronization offset of this slave char *master_replid; // The runid of the master corresponding to the slave&#39;s synchronization offset /* * The following conditions can avoid full synchronization: * 1. The master of the slave&#39;s last synchronization is the current instance (network jitter) * 2. The slave and the current node were originally slave nodes of the same master, and the current node&#39;s synchronization offset second_replid_offset is larger (maintenance restart, failover) */ if (strcasecmp(master_replid, server.replid) &amp;&amp; (strcasecmp(master_replid, server.replid2) ||psync_offset &gt; server.second_replid_offset)) { goto need_full_resync; // Does not meet PSYNC conditions, need full synchronization } /* * The following conditions can only perform full synchronization: * 1. The master has not initialized the backlog buffer * 2. The slave&#39;s synchronization offset is behind the backlog buffer */ if (!server.repl_backlog || psync_offset &lt; server.repl_backlog_off || psync_offset &gt; (server.repl_backlog_off + server.repl_backlog_histlen)) { goto need_full_resync; // Perform full synchronization } // If the code runs to this point, it means that partial synchronization can be performed listAddNodeTail(server.slaves,c); // Return different CONTINUE responses according to whether the client is compatible with PSYNC2 if (c-&gt;slave_capa &amp; SLAVE_CAPA_PSYNC2) { buflen = snprintf(buf,sizeof(buf),&quot;+CONTINUE %s\\r\\n&quot;, server.replid); } else { buflen = snprintf(buf,sizeof(buf),&quot;+CONTINUE\\r\\n&quot;); } // The CONTINUE command is followed by the content of server.repl_backlog psync_len = addReplyReplicationBacklog(c,psync_offset); // ... } Heartbeats &amp; Command Propagation Redis executes the timed task replicationCron once a second, which includes heartbeats between the master and slave. It can be found that the heartbeat frequencies of the master and slave are inconsistent: void replicationCron(void) { // The slave sends the REPLCONF ACK command to the master regularly if (server.masterhost &amp;&amp; server.master &amp;&amp; !(server.master-&gt;flags &amp; CLIENT_PRE_PSYNC)) { addReplyArrayLen(c,3); addReplyBulkCString(c,&quot;REPLCONF&quot;); addReplyBulkCString(c,&quot;ACK&quot;); addReplyBulkLongLong(c,c-&gt;reploff); } // The master sends the PING command to the slave regularly if ((replication_cron_loops % server.repl_ping_slave_period) == 0 &amp;&amp; listLength(server.slaves)) { robj *ping_argv[1]; ping_argv[0] = createStringObject(&quot;PING&quot;,4); replicationFeedSlaves(server.slaves, server.slaveseldb, ping_argv, 1); decrRefCount(ping_argv[0]); } } When the master calls the call function to execute the command passed by the client, it will propagate the command to the slave and write it to the replication backlog at the same time: void call(client *c, int flags) { // ... if (flags &amp; CMD_CALL_PROPAGATE &amp;&amp; (c-&gt;flags &amp; CLIENT_PREVENT_PROP) != CLIENT_PREVENT_PROP) { // Does the current command need to be propagated? if (propagate_flags != PROPAGATE_NONE &amp;&amp; !(c-&gt;cmd-&gt;flags &amp; CMD_MODULE)) propagate(c-&gt;cmd,c-&gt;db-&gt;id,c-&gt;argv,c-&gt;argc,propagate_flags); } } void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc, int flags) { // ... if (flags &amp; PROPAGATE_REPL) replicationFeedSlaves(server.slaves,dbid,argv,argc); } void replicationFeedSlaves(list *slaves, int dictid, robj **argv, int argc) { // If the current node does not have slave nodes or replication backlog, return immediately if (server.repl_backlog == NULL &amp;&amp; listLength(slaves) == 0) return; // Write commands to repl_backlog in batches if (server.repl_backlog) { char aux[LONG_STR_SIZE+3]; // Command buffer for serializing Redis commands /* Write the number of commands in the current batch */ aux[0] = &#39;*&#39;; len = ll2string(aux+1,sizeof(aux)-1,argc); aux[len+1] = &#39;\\r&#39;; aux[len+2] = &#39;\\n&#39;; feedReplicationBacklog(aux,len+3); /* Iterate over the commands and write them to repl_backlog after serialization */ for (j = 0; j &lt; argc; j++) { long objlen = stringObjectLen(argv[j]); aux[0] = &#39;$&#39;; len = ll2string(aux+1,sizeof(aux)-1,objlen); aux[len+1] = &#39;\\r&#39;; aux[len+2] = &#39;\\n&#39;; feedReplicationBacklog(aux,len+3); feedReplicationBacklogWithObject(argv[j]); feedReplicationBacklog(aux+len+1,2); } } // Propagate commands in batches to all clients corresponding to slaves listRewind(slaves,&amp;li); while((ln = listNext(&amp;li))) { client *slave = ln-&gt;value; /* Write the number of commands in the current batch */ addReplyArrayLen(slave,argc); /* Iterate over the commands and propagate them to the slave node */ for (j = 0; j &lt; argc; j++) addReplyBulk(slave,argv[j]); } } Relevant Configuration slave-serve-stale-data How the slave responds to client requests when the master-slave connection is broken or synchronization is not complete: yes: Respond to commands normally, but do not guarantee data quality no: Refuse to respond to commands and return SYNC with master in progress repl-diskless-sync How the master transfers the RDB snapshot to the slave when performing full synchronization: no: Generate RDB snapshot files on disk first and then transfer them (low bandwidth network) yes: Write the RDB snapshot directly to the slave&#39;s socket (low speed disk + high bandwidth network) repl-ping-slave-period The interval at which the master sends the PING heartbeat to the slave, the default is 10 seconds. repl-backlog-size The size of the replication backlog buffer, the default value is 1MB. Since all commands will be backlogged here after the master-slave connection is broken, if this value is too small, the PSYNC command will not be able to perform partial synchronization. If the master needs to execute a large number of write commands, or the slave takes a long time to reconnect successfully, you need to estimate it based on the actual situation. min-slaves-to-write &amp; min-slaves-max-lag When the following conditions are not met, the master will refuse to write commands until they are restored: min-slaves-to-write or more slave nodes connected to the current master are healthy At least min-slaves-to-write of the healthy slave nodes have a replication lag of less than min-slaves-max-lag seconds With these two options enabled, write commands are more likely to be replicated to min-slaves-to-write slave nodes, reducing the likelihood of command loss. "},{"slug":"redis-sentinel","title":"Redis Sentinel High Availability","tags":["Redis"],"content":"Redis Sentinel is a distributed monitoring system for Redis that can automatically perform failover when the master node is down and forward request traffic to healthy slave nodes. The Sentinel mechanism is an important part of Redis high availability. By monitoring the health status of master-slave replication through a high-availability Sentinel cluster and implement automatic disaster recovery: flowchart LR subgraph ss[&quot; &quot;] s1([Sentinel]) ---|keepalive| s2([Sentinel]) s2([Sentinel]) -- &quot;keepalive&quot; --- s3([Sentinel]) end slave((Slave)) ---|replicate&lt;br/&gt;&lt;br/&gt;| master((Master)) master &lt;---&gt; client{{Client}} ss -- &quot;monitor&lt;br/&gt;&lt;br/&gt;&quot; --- master ss -- &quot;monitor&lt;br/&gt;&lt;br/&gt;&quot; --- slave ss -- &quot;notify&lt;br/&gt;&lt;br/&gt;&quot; --&gt; client style ss fill:none,stroke-width:2px,stroke-dasharray: 5 5 linkStyle 4,5,6 stroke-dasharray: 5 5 The Sentinel cluster is deployed in a distributed manner which ensure: Avoid single points of failure in the system and prevent the disaster recovery mechanism from failing The switch master must be agreed upon by multiple Sentinel nodes to avoid misjudgment To ensure the high availability of Redis services, the sentinel mechanism provides the following functions: Monitoring: Real-time monitoring of the health status of master and slave nodes Notification: Use the event API to immediately inform listeners of abnormal situations in the service instance Automatic Failover: After the master node fails, select a new master from the slave Service Discovery: The client obtains master instance information through the Sentinel cluster, and can inform the client of master changes in a timely manner when automatic failover occurs Configurations Relevant Commands The configuration for establishing a sentinel cluster is relatively simple: sentinel monitor &lt;master-name&gt; &lt;ip&gt; &lt;port&gt; &lt;quorum&gt; Congifure the master node to be monitored (since sentinel will automatically discover slave node information, so there is no need to configure): master-name is used to distinguish between different master nodes and will be used for master discovery quorum is the minimum number of sentinel nodes required to initiate failover Before the failover, a leader node needs to be elected to perform master switching. In order to reach a consensus, majority (more than half) of the nodes or more must participate in this process. Assuming that the current Sentinel cluster has a total of m nodes, when quorum is set to n (n &le; m): If n nodes simultaneously judge that the current master is offline, one of the Sentinel nodes will try to initiate a failover The actual execution of the failover requires a leader election, so only when at least m/2 of the Sentinel nodes in the cluster are available can the failover be started In short, quorum only affects the failover detection process and is used to control the timing of initiating failover, but it cannot determine whether failover will be executed. Therefore, there should be at least 3 sentinel instances. Otherwise, once a Sentinel node fails, even if quorum is set to 1, failover cannot be started. sentinel down-after-milliseconds &lt;master-name&gt; &lt;milliseconds&gt; When a Redis node cannot respond normally for more than this time (does not respond to PING requests or returns an error code), sentinel will consider it offline sentinel parallel-syncs &lt;master-name&gt; &lt;numslaves&gt; After promoting a slave node in the slave to a master, the number of remaining nodes immediately reconnected to the new master Reconnecting will cause slave nodes to batch synchronize data from the master, which indirectly causes the slave to pause for a short period. If there are a total of m slave nodes in the current state, when parallel-syncs is set to n, failover will adjust the slave into m/n batches. The smaller the value, the longer the failover takes, but the less impact on the client accessing the slave. sentinel failover-timeout &lt;master-name&gt; &lt;milliseconds&gt; Retry interval for failover, with a default value of 3 minutes, which affects: The time interval for Sentinel to retry after initiating failover (2 * failover-timeout) The time required for Sentinel to direct expired slaves to the new master The time required to cancel an ongoing failover process The time failover waits for slave reconnection to the new master to complete Configuration Discovery Sentinel configurations doesn&#39;t require the sentinel nodes and slave node information. This is because the sentinel mechanism itself supports configuration discovery, and sentinel node can obtain the information from the monitored master node: Sentinel Discovery flowchart LR subgraph ss[&quot; &quot;] s1([Sentinel]) s2([Sentinel]) s3([Sentinel]) end s1 ---|SUBSCRIBE| topic[&quot;__sentinel__:hello&quot;] s2 ---&gt;|PUBLISH sentinel-ip:port| topic s3 ---|SUBSCRIBE| topic topic -.- master((Master)) style ss fill:none,stroke:none Sentinel nodes comunicate with each other with Redis&#39;s publish/subscribe mechanism. After the connection between sentinel and the master is established: Sentinel publishes a message to the __sentinel__:hello channel on the master, providing its own IP and port to other nodes. Sentinel subscribes to messages on the __sentinel__:hello channel to observe connection information published by other sentinels. 当多个哨兵实例都在主库上做了发布和订阅操作后，它们互相感知到彼此的 IP 地址和端口。 因此扩容哨兵集群十分简单，只需要启动一个新的哨兵节点即可，剩余的操作交由自动发现机制完成即可。 After performing pub/sub operations on the master, sentinel nodes become aware of each other&#39;s IP addresses and ports. Therefore, scaling the sentinel cluster is straightforward—simply by starting a new sentinel node, then the automatic discovery mechanism will finish remaining work. Slave Discovery flowchart TB subgraph sd[&quot; &quot;] master --&gt;|SLAVE1,2| s s([Sentinel]) --&gt;|INFO| master((Master)) subgraph ms[&quot; &quot;] direction TB s1((Slave)) s2((Slave)) end master -.- s1 master -.- s2 end style sd fill:none,stroke:none style ms fill:none,stroke:none Sentinal node obtain slave nodes information by sending INFO command to the master. Then sentinel establishes connections to each slave and continuously monitors them on these connections. At the same time, the sentinel also retrieve the following information by sending INFO command to the slaves: run_id: The run ID of the slave. slave_priority: The priority of the slave. slave_repl_offset: The replication offset of the slave. Node Offline The sentinel never forgets nodes it has seen, whether they are sentinels or slaves. When it&#39;s necessary to take a node offline from the cluster, the SENTINEL RESET command is used: When taking a sentinel node offline, first stop the process of that node, then execute ENTINEL RESET * on the remaining sentinel nodes to update the cluster information. When taking a slave node offline, first stop the process of that slave, then execute SENTINEL RESET &lt;master-name&gt; on all sentinel nodes to update the monitoring list. Status Monitoring To ensure the availability of the cluster master, sentinel cluster periodically sends PING commands to the master and slave nodes: If the response is +PONG, -LOADING, or -MASTERDOWN, the node is considered healthy. If the response is any other value or there is no response, the node is considered unhealthy. If a node remains unhealthy for more than down-after-milliseconds, it is considered offline. There is also a special case: if a master node identifies itself as a slave node in the INFO command response, the sentinel will also consider that node offline. To reduce false positives, the sentinel cluster divides node offline detection into two stages: Subjective Down SDOWN: a sentinel instance considers the node offline. Objective Down ODOWN: a sentinel send SENTINEL is-master-down-by-addr command to others sentinel and discovers that a quorum of sentinel instances consider the node offline. Only master nodes are marked as ODOWN, which will trigger a failover. Slave and sentinel nodes are only marked as SDOWN. Failover The failover process is designed as an asynchronous state machine with the following main steps: void sentinelFailoverStateMachine(sentinelRedisInstance *ri) { serverAssert(ri-&gt;flags &amp; SRI_MASTER); if (!(ri-&gt;flags &amp; SRI_FAILOVER_IN_PROGRESS)) return; switch(ri-&gt;failover_state) { // Elect leader case SENTINEL_FAILOVER_STATE_WAIT_START: sentinelFailoverWaitStart(ri); break; // Select a candidate node from the slaves of the offline master case SENTINEL_FAILOVER_STATE_SELECT_SLAVE: sentinelFailoverSelectSlave(ri); break; // Send SLAVEOF NO ONE command to the selected slave to make it a master case SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE: sentinelFailoverSendSlaveOfNoOne(ri); break; // Check if the new master node is ready using the INFO command case SENTINEL_FAILOVER_STATE_WAIT_PROMOTION: sentinelFailoverWaitPromotion(ri); break; // Send SLAVEOF command to the remaining slave nodes to point to the new master case SENTINEL_FAILOVER_STATE_RECONF_SLAVES: sentinelFailoverReconfNextSlave(ri); break; } } Leader Election A failover is triggered when master node is marked as ODOWN. To ensure eventual convergence to a consistent state, each modification to the master-slave configuration is associated with a globally unique monotonically increasing version number called the configuration epoch. Changes with smaller epochs will be overridden by changes with larger epochs, thus ensuring distributed consistency in concurrent modifications. To avoid unnecessary failovers, the sentinel cluster elects a leader sentinel for each epoch which responsible to implement configuration changes,. block-beta columns 3 block:hint:3 space space space title{{&quot;epoch++&quot;}} space space space end block:epoch:3 space start space space space space space x end block:ss1:1 space s1(&quot;epoch = 1 leader = B&quot;) space end block:ss2:1 space s2(&quot;epoch = 3 leader = A&quot;) space end block:ss3:1 space s3(&quot;epoch = 10 leader = C&quot;) space end s1A([&quot;Sentinel A&quot;]) s2A([&quot;Sentinel A&quot;]) s3A([&quot;Sentinel A&quot;]) s1B([&quot;Sentinel B&quot;]) s2B([&quot;Sentinel B&quot;]) s3B([&quot;Sentinel B&quot;]) s1C([&quot;Sentinel C&quot;]) s2C([&quot;Sentinel C&quot;]) s3C([&quot;Sentinel C&quot;]) start(((&quot; &quot;))) --&gt; x(&quot; &quot;) classDef hidden fill:none,stroke:none classDef leader stroke-width:4px,stroke-dasharray: 10 5 class x,epoch,hint,ss1,ss2,ss3 hidden class s1B,s2A,s3C leader Election is completed through the command SENTINEL IS-MASTER-DOWN-BY-ADDR &lt;ip&gt; &lt;port&gt; &lt;current-epoch&gt; &lt;runid&gt;: char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) { // If the requested epoch for voting is greater than the known one, update the local epoch if (req_epoch &gt; sentinel.current_epoch) { sentinel.current_epoch = req_epoch; sentinelFlushConfig(); sentinelEvent(LL_WARNING,&quot;+new-epoch&quot;,master,&quot;%llu&quot;, (unsigned long long) sentinel.current_epoch); } // If the requested epoch for voting is greater than the current leader&#39;s and does not exceed the current epoch if (master-&gt;leader_epoch &lt; req_epoch &amp;&amp; sentinel.current_epoch &lt;= req_epoch) { // According to the FCFS principle, vote for the epoch to this sentinel sdsfree(master-&gt;leader); master-&gt;leader = sdsnew(req_runid); master-&gt;leader_epoch = sentinel.current_epoch; sentinelFlushConfig(); sentinelEvent(LL_WARNING,&quot;+vote-for-leader&quot;,master,&quot;%s %llu&quot;, master-&gt;leader, (unsigned long long) master-&gt;leader_epoch); // If this is a voting request from another sentinel, update the failover start time // to avoid unnecessary voting by this instance within the failover timeout if (strcasecmp(master-&gt;leader,sentinel.myid)) master-&gt;failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC; } // Requests with an epoch less than sentinel.current_epoch will be ignored // Update leader information *leader_epoch = master-&gt;leader_epoch; return master-&gt;leader ? sdsnew(master-&gt;leader) : NULL; } This election process is a simplified version of the Raft protocol. Choose Slave To ensure that the new master has the latest state, the leader will: Exclude all nodes in subjective offline state (node health). Exclude nodes that have not responded to the INFO command issued by the leader within the last 5 seconds (communication normal). Exclude nodes whose disconnection time from the original master exceeds down-after-milliseconds * 10 (replica relatively new). Finally, the nodes are sorted according to slave_priority, slave_repl_offset, and run_id. The node with the highest priority, maximum offset, and minimum run ID will be selected as the new master. Promote Master First, call sentinelFailoverSendSlaveOfNoOne to promote the candidate node to master: void sentinelFailoverSendSlaveOfNoOne(sentinelRedisInstance *ri) { int retval; // Keep trying until failover timeout if candidate node is unavailable if (ri-&gt;promoted_slave-&gt;link-&gt;disconnected) { if (mstime() - ri-&gt;failover_state_change_time &gt; ri-&gt;failover_timeout) { sentinelEvent(LL_WARNING,&quot;-failover-abort-slave-timeout&quot;,ri,&quot;%@&quot;); sentinelAbortFailover(ri); } return; } // Send SLAVEOF ON ONE command and wait for it to become master retval = sentinelSendSlaveOf(ri-&gt;promoted_slave,NULL,0); if (retval != C_OK) return; sentinelEvent(LL_NOTICE, &quot;+failover-state-wait-promotion&quot;, ri-&gt;promoted_slave,&quot;%@&quot;); ri-&gt;failover_state = SENTINEL_FAILOVER_STATE_WAIT_PROMOTION; ri-&gt;failover_state_change_time = mstime(); } Then, call sentinelFailoverReconfNextSlave to make the remaining slaves replicate the new master node: void sentinelFailoverReconfNextSlave(sentinelRedisInstance *master) { // ... // Batch adjust slave nodes, ensuring that the number of batches does not exceed the parallel syncs configuration di = dictGetIterator(master-&gt;slaves); while(in_progress &lt; master-&gt;parallel_syncs &amp;&amp; (de = dictNext(di)) != NULL) { sentinelRedisInstance *slave = dictGetVal(de); int retval; // Skip nodes that have been adjusted if (slave-&gt;flags &amp; (SRI_PROMOTED|SRI_RECONF_DONE)) continue; // If a slave fails to complete the configuration change for a long time, it is still considered completed // Sentinels will detect configuration anomalies and fix them in subsequent processes if ((slave-&gt;flags &amp; SRI_RECONF_SENT) &amp;&amp; (mstime() - slave-&gt;slave_reconf_sent_time) &gt; SENTINEL_SLAVE_RECONF_TIMEOUT) { sentinelEvent(LL_NOTICE,&quot;-slave-reconf-sent-timeout&quot;,slave,&quot;%@&quot;); slave-&gt;flags &amp;= ~SRI_RECONF_SENT; slave-&gt;flags |= SRI_RECONF_DONE; } // Skip nodes that have already sent commands or are offline if (slave-&gt;flags &amp; (SRI_RECONF_SENT|SRI_RECONF_INPROG)) continue; if (slave-&gt;link-&gt;disconnected) continue; // Send SLAVEOF to make it replicate the new master retval = sentinelSendSlaveOf(slave, master-&gt;promoted_slave-&gt;addr-&gt;ip, master-&gt;promoted_slave-&gt;addr-&gt;port); if (retval == C_OK) { slave-&gt;flags |= SRI_RECONF_SENT; slave-&gt;slave_reconf_sent_time = mstime(); sentinelEvent(LL_NOTICE,&quot;+slave-reconf-sent&quot;,slave,&quot;%@&quot;); in_progress++; } } // Check if all slave nodes have completed the configuration change sentinelFailoverDetectEnd(master); } When the offline master comes back agian, the sentinel nodes detect that its configuration has become invalid and treat it as a slave, making it becomes a slave of the new master. This also means that the part of the data on this node that has not been synchronized to the new master will be permanently lost. To reduce data loss, you can use the parameters min-replicas-to-write and min-replicas-max-lag to prevent clients from writing data to the master node that has lost the slave. Event API Redis Sentinel provides an event subscription mechanism that allows clients to receive notifications about various events occurring within the Sentinel cluster. These events can be used to monitor the health and status of the cluster, and to trigger actions based on specific events. Events usually consist of the following parts(the part after @ is optional): &lt;instance-type&gt; &lt;name&gt; &lt;ip&gt; &lt;port&gt; @ &lt;master-name&gt; &lt;master-ip&gt; &lt;master-port&gt; There are some available events: switch-master: A new master node is elected.The message payload format is &lt;master-name&gt; &lt;oldip&gt; &lt;oldport&gt; &lt;newip&gt; &lt;newport&gt;. +sdown: A node enters subjective down state.This means that the Sentinel node has not received a response from the node for a certain period of time. -sdown: A node exits subjective down state.This means that the Sentinel node has received a response from the node, indicating that it is back online. +odown: A node enters objective down state.This means that a majority of Sentinel nodes in the cluster agree that the node is down. -odown: A node exits objective down state.This means that a majority of Sentinel nodes in the cluster agree that the node is back online. +tilt: The Sentinel cluster enters TILT mode. This mode is triggered when there are insufficient healthy Sentinel nodes to make quorum decisions. -tilt: The Sentinel cluster exits TILT mode. This occurs when there are enough healthy Sentinel nodes to resume quorum operations. +reset-master: The monitoring information for a master node is reset. This typically happens after a failover or manual configuration changes. +failover-detected: A failover is detected. This could be initiated by Sentinel or by manually promoting a slave node to master. failover-end: The failover process is complete and all slave nodes have successfully replicated from the new master. failover-end-for-timeout: The failover process times out. This means that not all slave nodes have replicated from the new master within the specified timeout period. To subscribe to all these events, the PSUBSCRIBE * command can be used. JedisSentinelPool To deepen understanding, let&#39;s analyze the source code of JedisSentinelPool in jedis-3.3.0 to observe how the Event API works. During initialization, JedisSentinelPool calls the initSentinels function to obtain master information: private HostAndPort initSentinels(Set&lt;String&gt; sentinels, final String masterName) { HostAndPort master = null; // Iterating through sentinel information and establishing connections for (String sentinel : sentinels) { final HostAndPort hap = HostAndPort.parseString(sentinel); Jedis jedis = null; try { jedis = new Jedis(hap.getHost(), hap.getPort(), sentinelConnectionTimeout, sentinelSoTimeout); // ... // Sending get-master-addr-by-name command to obtain master node List&lt;String&gt; masterAddr = jedis.sentinelGetMasterAddrByName(masterName); if (masterAddr == null || masterAddr.size() != 2) { log.warn(&quot;Can not get master addr, master name: {}. Sentinel: {}&quot;, masterName, hap); continue; } // Exiting after obtaining master node information master = toHostAndPort(masterAddr); break; } catch (JedisException e) { log.warn( &quot;Cannot get master address from sentinel running @ {}. Reason: {}. Trying next one.&quot;, hap, e); } finally { if (jedis != null) { jedis.close(); } } } if (master == null) { // Unable to obtain master information, an exception will be thrown here // ... } // Starting listener thread to monitor all sentinels to promptly detect cluster changes for (String sentinel : sentinels) { final HostAndPort hap = HostAndPort.parseString(sentinel); MasterListener masterListener = new MasterListener(masterName, hap.getHost(), hap.getPort()); masterListener.setDaemon(true); masterListeners.add(masterListener); masterListener.start(); } return master; } The MasterListener class listens for master node changes via the Event API and reinitializes the connection pool: class MasterListener extends Thread { protected String masterName; protected String host; protected int port; protected long subscribeRetryWaitTimeMillis = 5000; protected volatile Jedis j; protected AtomicBoolean running = new AtomicBoolean(false); public MasterListener(String masterName, String host, int port) { super(String.format(&quot;MasterListener-%s-[%s:%d]&quot;, masterName, host, port)); this.masterName = masterName; this.host = host; this.port = port; } @Override public void run() { running.set(true); while (running.get()) { try { // Establishing connection with sentinel j = new Jedis(host, port, sentinelConnectionTimeout, sentinelSoTimeout); // ... // Getting master information again List&lt;String&gt; masterAddr = j.sentinelGetMasterAddrByName(masterName); if (masterAddr == null || masterAddr.size() != 2) { log.warn(&quot;Can not get master addr, master name: {}. Sentinel: {}:{}.&quot;, masterName, host, port); } else { // Reinitialize connection pool if master changes initPool(toHostAndPort(masterAddr)); } // Listening for +switch-master event to detect master node changes j.subscribe(new JedisPubSub() { @Override public void onMessage(String channel, String message) { // Master node has changed String[] switchMasterMsg = message.split(&quot; &quot;); if (switchMasterMsg.length &gt; 3) { // Only process information related to the current master-name if (masterName.equals(switchMasterMsg[0])) { // Reinitialize connection pool if master changes initPool(toHostAndPort(Arrays.asList(switchMasterMsg[3], switchMasterMsg[4]))); } } else { log.error( &quot;Invalid message received on Sentinel {}:{} on channel +switch-master: {}&quot;, host, port, message); } } }, &quot;+switch-master&quot;); } catch (JedisException e) { if (running.get()) { // Retry after connection loss, wait for 5s log.error(&quot;Lost connection to Sentinel at {}:{}. Sleeping 5000ms and retrying.&quot;, host, port, e); try { Thread.sleep(subscribeRetryWaitTimeMillis); } catch (InterruptedException e1) { log.error(&quot;Sleep interrupted: &quot;, e1); } } else { log.debug(&quot;Unsubscribing from Sentinel at {}:{}&quot;, host, port); } } finally { if (j != null) { j.close(); } } } } public void shutdown() { try { log.debug(&quot;Shutting down listener on {}:{}&quot;, host, port); running.set(false); // This isn&#39;t good, the Jedis object is not thread safe if (j != null) { j.disconnect(); } } catch (Exception e) { log.error(&quot;Caught exception while shutting down: &quot;, e); } } } "},{"slug":"redis-data-structures-and-object-encoding","title":"Redis Data Structures and Object Encoding","tags":["Redis"],"content":"Redis provides out-of-the-box support for commonly used data structures. This article delves into the implementation of these data structures. Data Types Redis offers support for various data structures: string: Represents strings (can store strings, integers, bitmaps). list: Represents lists (can be used as arrays, stacks, double-ended queues, blocking queues). hash: Represents hash tables. set: Represents sets. zset: Represents sorted sets. To optimize performance, Redis authors provide different implementations for each data structure to adapt to specific application scenarios. Taking the string as an example, its underlying implementation can be categorized into 3 types: int, embstr, raw: 127.0.0.1:6379&gt; SET counter 1 OK 127.0.0.1:6379&gt; OBJECT ENCODING counter &quot;int&quot; 127.0.0.1:6379&gt; SET name &quot;Tom&quot; OK 127.0.0.1:6379&gt; OBJECT ENCODING name &quot;embstr&quot; 127.0.0.1:6379&gt; SETBIT bits 1 1 (integer) 0 127.0.0.1:6379&gt; OBJECT ENCODING bits &quot;raw&quot; These specific underlying implementations in Redis are referred to as encoding. Let&#39;s check out these encoding implementations one by one. string All keys in Redis are strings, implemented through a data structure called Simple Dynamic Strings (SDS). typedef char *sds; // SDS string pointer, points to sdshdr.buf struct sdshdr? { // SDS header, [?] can be 8, 16, 32, 64 uint?_t len; // Used space, actual length of the string uint?_t alloc; // Allocated space, excluding &#39;\\0&#39; unsigned char flags; // Type tag, indicates the actual types of len and alloc, accessible through sds[-1] char buf[]; // Character array, saves &#39;\\0&#39;-terminated strings, consistent with the representation of strings in traditional C language }; Memory layout: +-------+---------+-----------+-------+ | len | alloc | flags | buf | +-------+---------+-----------+-------+ ^--sds[-1] ^--sds Advantages over traditional C strings: Efficiency: Records used space, achieving $O(1)$ for obtaining string length. Safety: Records free space, avoiding buffer overflow issues. Memory-friendly: By recording space information, space can be pre-allocated to reduce memory re-allocation operations. Binary safety: String content can be non-ASCII encoded, allowing any data to be encoded as binary strings. Compatibility with C strings: Some parts of the C standard library code can be reused, avoiding redundant code. list One of the underlying implementations of lists in Redis is a doubly linked list, which supports sequential access and provides efficient element addition and deletion. typedef struct listNode { struct listNode *prev; // Previous node struct listNode *next; // Next node void *value; // Node value } listNode; typedef struct list { listNode *head; // Head node listNode *tail; // Tail node unsigned long len; // Length of the list void *(*dup) (void *ptr); // Node value duplication function void (*free) (void *ptr); // Node value freeing function int (*match) (void *ptr); // Node value comparison function } list; Function pointers are used here for dynamic binding during runtime. Different dup, free, and match functions are specified according to the value type to achieve polymorphism. This data structure has the following characteristics: Obtaining the length of the list is $O(1)$. Supports both forward and backward traversal, with $O(1)$ for obtaining head and tail nodes. No sentinel nodes are set; when the list is empty, both the head and tail are NULL. Polymorphism is achieved through function pointers, enabling data structure reuse. dict Redis uses a dictionary to store key-value pairs, and one of its underlying implementations is a hash table. typedef struct dictEntry { void* key; // Key union { // Value, can be a pointer, signed long integer, unsigned long integer, or double-precision floating point void *val; uint64_t u64; int64_t s64; double d; } v; struct dictEntry *next; } dictEntry; typedef struct dictht { dictEntry **table; // Hash table array, each element in the array is a singly linked list unsigned long size; // Size of the hash table array unsigned long sizemask; // Hash mask used for index calculation unsigned long used; // Number of existing nodes } dictht; typedef struct dictType { unsigned int (*hashFunction) (const void *key); // Hash function used for calculating hash values int (*keyCompare)(void *privdata, const void *key1, const void *key2); // Key comparison function void *(*keyDup)(void *privdata, const void *key); // Key duplication function void *(*valDup)(void *privdata, const void *obj); // Value duplication function void *(*keyDestructor)(void *privdata, const void *key); // Key destruction function void *(*valDestructor)(void *privdata, const void *obj); // Value destruction function } dictType; typedef struct dict { dictType *type; // Type functions for achieving polymorphism void *privdata; // Private data for achieving polymorphism dictht ht[2]; // Hash table; dict uses ht[0] as the hash table, and ht[1] is used for rehashing int rehashidx; // Rehash index, -1 when no rehash is in progress } dict; This data structure has the following characteristics: Uses murmurhash2 as the hash function with a time complexity of $O(1)$. Resolving collisions by adding new elements to the head of linked list. Each rehash operation is completed in 3 steps: Allocate space for dict.ht[1], with its size being $2^n$ Rehash all key-value pairs from dict.ht[0] to dict.ht[1] Free the space of dict.ht[0] and replace it with dict.ht[1] Details of Rehashing Amortized overhead Step 2 is gradually completed multiple times, spreading the calculation work required for rehashing key-value pairs evenly across each dictionary&#39;s addition, deletion, lookup, and update operations. During this process, dict.rehashidx is used to record the index of dictht.table in dict.ht[0] that has already completed rehashing: Each time a rehash operation is performed, dict.rehashidx counter is incremented. When rehashing is completed, dict.rehashidx is set to -1. Triggering conditions Calculate the current load factor: load_factor = ht[0].used / ht[0].size ShrinkingWhen load_factor &lt; 0.1, rehashing is executed to reclaim idle space. Expanding When BGSAVE or BGREWRITEAOF commands are not executed, and load_factor &gt;= 1, rehashing is performed. When BGSAVE or BGREWRITEAOF commands are being executed, and load_factor &gt;= 5, rehashing is performed. Many OS adopt copy-on-write technology: Parent and child processes share the same data until the data is modified, at which point the memory space is actually copied to the child process to ensure data isolation Redis perform BGSAVE or BGREWRITEAOF commands with child process. During this time, the server increases the threshold of loader_factor to avoid unnecessary memory write operations during the existence of the child process, saving memory. skiplist Skiplist is an ordered data structure that achieves fast access by maintaining multiple levels of pointers. It is a typical space-time trade-off strategy. Its search efficiency is close to AVL or BR tree, but with lower maintenance costs and simpler implementation. typedef struct zskiplistNode { sds ele; // Member object double score; // Score struct zskiplistNode *backward; // Backward pointer struct zskiplistLevel { struct zskiplistNode *forward; // Forward pointer unsigned long span; // Span, the distance between the current node and the forward node } level[]; } zskiplistNode; typedef struct zskiplist { struct zskiplistNode *header, *tail;// Head and tail pointers unsigned long length; // Length int level; // Maximum level } zskiplist; This data structure has the following characteristics: Average search time is $O(\\log N)$, worst-case search time is $O(N)$, and it supports range search. Each time a node is created, the program generates a random number between 1 and 32 according to the power law to determine the level. During the node search process, all span distances visited along the way are accumulated to obtain the ranking of the target node in the list. intset An ordered integer set with compact memory space. The time complexity of adding operations is $O(N)$. typedef struct intset { uint32_t encoding; // Encoding method, indicating the actual type of elements uint32_t length; // Number of elements int8_t contents[]; // Element array, actual types can be int16_t, int32_t, int64_t, } intset; This data structure has the following characteristics: Elements in the array are arranged in ascending order, with a binary search time complexity of $O(\\log N)$. When the size of newly-added element overflows current encoding types, the set needs to be upgraded: Expand the space of array according to the type of the new element. Convert all existing elements to the new type. Add the new element to the array. ziplist Ziplist aimed to reduce memory footprint, is a sequential data structure stored in contiguous memory blocks. A ziplist can contain any number of entry nodes, each node containing a byte array or integer. Redis does not explicitly define the ziplist data structure but provides a description structure zlentry for data manipulation. typedef struct zlentry { unsigned int prevrawlensize;// Used to record the number of bytes of the previous entry length unsigned int prevrawlen; // Length of the previous entry unsigned int lensize // Used to record the type/length of the current entry (variable length: 1 byte/5 bytes) unsigned int len; // Actual number of bytes used to store data unsigned int headersize; // prevrawlensize + lensize unsigned char encoding; // Used to indicate the actual encoding type of the entry data unsigned char *p; // Points to the beginning of the entry } zlentry; The actual memory layout is as follows: +----------+---------+---------+--------+-----+--------+--------+ | zlbytes | zltail | zllen | entry1 | ... | entryN | zlend | +----------+---------+---------+--------+-----+--------+--------+ &lt;--------------------------- zlbytes ---------------------------&gt; ^--zltail &lt;------- zllen -------&gt; zlbytes : Number of bytes occupied by the ziplist (u_int32) zltail: Compressed list tail offset, used to locate the table tail address for reverse traversal (u_int32) zllen: The number of nodes in the compressed list. When it greater than 65535, the specific number needs to be obtained by traversing (u_int16) entryX: List entries , the specific length is variable zlend : End of list, with magic value 0xFF (u_int8) The memory layout of each entry: +-------------------+----------+---------+ | prev_entry_length | encoding | content | +-------------------+----------+---------+ prev_entry_length: Length of the previous node, used to calculate the starting address of the previous node based on the address of the current node (variable length: 1 byte/5 bytes). encoding: Type and length of the data saved by the node (variable length: 1 byte/2 bytes/5 bytes). content: Data associated with the node, can store integers or byte arrays 该数据结构具有以下特征： 结构紧凑：一整块连续内存，没有多余的内存碎片，更新会导致内存 realloc 与内存复制，平均时间复杂度为 $O(N)$ 逆向遍历：从表尾开始向表头进行遍历 连锁更新：对前一条数据的更新，可能导致后一条数据的 prev_entry_length 与 encoding 所需长度变化，产生连锁反应，更新操作最坏时间为 $O(N^2)$ This data structure has the following characteristics: Only allow traversal from the tail to the head. Avoding memory fragment by using single contiguous block.But updating may lead to memory reallocation and copying, with an average time complexity of $O(N)$. Updating the previous data may lead to changes in the number of bytes required for prev_entry_length and encoding in the next data, causing a chain reaction.The worst-case time complexity for updating operations is $O(N^2)$. quicklist In earlier versions of Redis, lists had two underlying implementations: ziplist: when the length or number of elements in the list object was small linkedlist: When the length or number of elements in the list object was large Each has its pros and cons: ziplist compact memory and high access efficiency update slowly, and may cause a large amount of memory copying. linkedlist high efficiency in node modification requires additional memory overhead, and may produce a large amount of memory fragmentation. To combine the advantages of both, Redis changed the underlying implementation of lists to a quicklist after version 3.2. Quicklist is a combination of linkedlist and ziplist: it contains multiple nodes with non-contiguous memory, but each node is a ziplist. typedef struct quicklistNode { struct quicklistNode *prev; // Previous ziplist struct quicklistNode *next; // Next ziplist unsigned char *zl; // Data pointer, pointing to ziplist structure, or quicklistLZF structure unsigned int sz; // Memory size occupied by ziplist (uncompressed) unsigned int count : 16; // Number of records in ziplist unsigned int encoding : 2; // Encoding method, 1 indicates ziplist, 2 indicates quicklistLZF unsigned int container : 2; // unsigned int recompress : 1; // Temporary decompression, 1 indicates temporary decompression for access unsigned int attempted_compress : 1; // Test field unsigned int extra : 10; // Reserved space } quicklistNode; typedef struct quicklistLZF { unsigned int sz; // Compressed data length char compressed[]; // Compressed data } quicklistLZF; typedef struct quicklist { quicklistNode *head; // List head quicklistNode *tail; // List tail unsigned long count; // Total number of records unsigned long len; // Number of ziplists int fill : 16; // Ziplist length limit, the length (record number/memory occupancy) of each ziplist node cannot exceed this value unsigned int compress : 16; // Compression depth, indicating the number of ziplist nodes at both ends of quicklist that are not compressed, 0 indicates that all ziplist nodes are not compressed } quicklist; This data structure has the following characteristics: Combining the advantages of linkedlist and ziplist, there is no need to switch between the two structures. When access frequency to the middle part is very low (e.g. queue), the data in the middle can be compressed to reduce memory usage. robj To implement dynamic encoding technology, Redis built an object system: Redis can determine whether specific command can be perform on a object based its type. In addition, this system implements memory sharing through reference counting and records the access time of objects, providing a basis for optimizing memory reclamation strategies. typedef struct redisObject { unsigned type:4; // Type, the logical type of the current object, for example: set unsigned encoding:4; // Encoding, the underlying implementation data structure, for example: intset / ziplist unsigned lru:24; /* LRU time (relative to the global lru_clock time) or * LFU data (8 bits record access frequency, 16 bits record access time). */ int refcount; // Reference count void *ptr; // Data pointer, pointing to specific data structures } robj; This data structure has the following characteristics: Objects of the same type in Redis can use different underlying implementations, optimizing object usage efficiency in different application scenarios. For memory string objects of integer values, Redis can reduce memory copying by recording reference counts. The object system records the access time of objects, facilitating the LRU algorithm to prioritize recycling less-used objects. Data Encodings string The encoding types for strings can be: int (OBJ_ENCODING_INT): Long integer type raw (OBJ_ENCODING_RAW): SDS string embstr (OBJ_ENCODING_EMBSTR): Embedded string (strings with encoded lengths less than 44 bytes) 127.0.0.1:6379&gt; SET str &quot;1234567890 1234567890 1234567890 1234567890&quot; OK 127.0.0.1:6379&gt; STRLEN str (integer) 43 127.0.0.1:6379&gt; OBJECT ENCODING str &quot;embstr&quot; 127.0.0.1:6379&gt; APPEND str _ (integer) 44 127.0.0.1:6379&gt; OBJECT ENCODING str &quot;raw&quot; Using embstr encoding is to reduce unnecessary memory allocation for short strings. According to the original words of the Redis author: REDIS_ENCODING_EMBSTR_SIZE_LIMIT set to 39.The new value is the limit for the robj + SDS header + string + null-term to stay inside the 64 bytes Jemalloc arena in 64 bits systems. The comparison of the memory layouts reveals: embstr is a complete contiguous memory block, requiring only once memory allocation memory of raw is non-contiguous, requiring twice memory allocations &lt;------------------------------------------ Jemalloc arena (64 bytes) ----------------------------------------------&gt; +-------------------------------------------------------------------------------+---------------------+--------------+ | redisObject (16 bytes) | sdshdr8 (3 bytes) | 45 bytes | +--------------------+---------------------------------+-------+----------+-----+-----+-------+-------+---------+----+ | type(REDIS_STRING) | encoding(REDIS_ENCODING_EMBSTR) | lru | refcount | ptr | len | alloc | flags | buf | \\0 | +--------------------+---------------------------------+-------+----------+-----+-----+-------+-------+---------+----+ +--------------------+ | redisObject | +--------------------+ | type | | REDIS_STRING | +--------------------+ | encoding | | REDIS_ENCODING_RAW | +--------------------+ +---------+ | ptr | ---&gt; | sdshdr? | +--------------------+ +---------+ | len | +---------+ | alloc | +---------+ | flags | +---------++---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | buf || T | h | e | r | e | | i | s | | n | o | | c | e | r | t | a |...| +---------++---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ list The default encoding type for lists is quicklist (OBJ_ENCODING_QUICKLIST) list 默认的编码类型为 quicklist (OBJ_ENCODING_QUICKLIST) list-max-ziplist-size：Length of the ziplist on each quicklist node list-compress-depth：Number of nodes at the ends of the quicklist that are not compressed hash Hashes have encoding types of ziplist (OBJ_ENCODING_ZIPLIST) and hashtable (OBJ_ENCODING_HT). The specific encoding used depends on the following two options: hash-max-ziplist-value： Uses ziplist encoding when both key and value lengths are less than this value (default is 64) hash-max-ziplist-entries：Uses ziplist encoding when the number of elements in the hash is less than this value (default is 512) When the key length exceeds 64: 127.0.0.1:6379&gt; HSET table x &#39;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&#39; (integer) 0 127.0.0.1:6379&gt; OBJECT ENCODING table &quot;ziplist&quot; 127.0.0.1:6379&gt; HSET table x &#39;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&#39; (integer) 0 127.0.0.1:6379&gt; OBJECT ENCODING table &quot;hashtable&quot; 127.0.0.1:6379&gt; DEL table (integer) 1 127.0.0.1:6379&gt; HSET table xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx &#39;x&#39; (integer) 1 127.0.0.1:6379&gt; OBJECT ENCODING table &quot;ziplist&quot; 127.0.0.1:6379&gt; HSET table xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx &#39;x&#39; (integer) 1 127.0.0.1:6379&gt; OBJECT ENCODING table &quot;hashtable&quot; When the value length exceeds 64: 127.0.0.1:6379&gt; HSET table x &#39;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&#39; (integer) 0 127.0.0.1:6379&gt; OBJECT ENCODING table &quot;ziplist&quot; 127.0.0.1:6379&gt; HSET table x &#39;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&#39; (integer) 0 127.0.0.1:6379&gt; OBJECT ENCODING table &quot;hashtable&quot; 127.0.0.1:6379&gt; DEL table (integer) 1 127.0.0.1:6379&gt; HSET table xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx &#39;x&#39; (integer) 1 127.0.0.1:6379&gt; OBJECT ENCODING table &quot;ziplist&quot; 127.0.0.1:6379&gt; HSET table xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx &#39;x&#39; (integer) 1 127.0.0.1:6379&gt; OBJECT ENCODING table &quot;hashtable&quot; When the number of elements exceeds 512: 127.0.0.1:6379&gt; EVAL &quot;for i=1,512 do redis.call(&#39;HSET&#39;, KEYS[1], i, i) end&quot; 1 numbers (nil) 127.0.0.1:6379&gt; HLEN numbers (integer) 512 127.0.0.1:6379&gt; OBJECT ENCODING numbers &quot;ziplist&quot; 127.0.0.1:6379&gt; DEL numbers (integer) 1 127.0.0.1:6379&gt; EVAL &quot;for i=1,513 do redis.call(&#39;HSET&#39;, KEYS[1], i, i) end&quot; 1 numbers (nil) 127.0.0.1:6379&gt; HLEN numbers (integer) 513 127.0.0.1:6379&gt; OBJECT ENCODING numbers &quot;hashtable&quot; set The encoding types for sets are intset (OBJ_ENCODING_INTSET) and hashtable (OBJ_ENCODING_HT). The specific encoding used depends on the following two options: Consider using intset encoding only when all elements in the set are integers set-max-intset-entries: Uses intset encoding when the number of elements is less than this value (default is 512) Cases including non-integer elements: 127.0.0.1:6379&gt; SADD set 1 2 (integer) 2 127.0.0.1:6379&gt; OBJECT ENCODING set &quot;intset&quot; 127.0.0.1:6379&gt; SADD set &quot;ABC&quot; (integer) 1 127.0.0.1:6379&gt; OBJECT ENCODING set &quot;hashtable&quot; Cases when the number of elements exceeds 512: 127.0.0.1:6379&gt; EVAL &quot;for i=1,512 do redis.call(&#39;SADD&#39;, KEYS[1], i, i) end&quot; 1 numbers (nil) 127.0.0.1:6379&gt; SCARD numbers (integer) 512 127.0.0.1:6379&gt; OBJECT ENCODING numbers &quot;intset&quot; 127.0.0.1:6379&gt; DEL numbers (integer) 1 127.0.0.1:6379&gt; EVAL &quot;for i=1,513 do redis.call(&#39;SADD&#39;, KEYS[1], i, i) end&quot; 1 numbers (nil) 127.0.0.1:6379&gt; SCARD numbers (integer) 513 127.0.0.1:6379&gt; OBJECT ENCODING numbers &quot;hashtable&quot; zset The encoding types for sorted sets are ziplist (OBJ_ENCODING_ZIPLIST) and skiplist (OBJ_ENCODING_SKIPLIST). When using ziplist encoding, each set element is saved using two adjacent entry nodes. The first node saves the member value (member), and the second node saves the element&#39;s score value (score). Entries are sorted in ascending order based on the score: +----------------------+ | redisObject | +----------------------+ | type | | REDIS_ZSET | +----------------------+ | encoding | | OBJ_ENCODING_ZIPLIST | +----------------------+ +----------+----------+---------+--------------------+-------------------+-----+-----------------------+--------------------+-------+ | ptr | ---&gt; | zlbytes | zltail | zllen | entry 1 (member 1) | entry 2 (score 1) | ... | entry 2N-1 (member N) | entry 2N (score N) | zlend | +----------------------+ +----------+----------+---------+--------------------+-------------------+-----+-----------------------+--------------------+-------+ &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; score increase &gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt; When implemented using skiplist, a data structure named zset is used: typedef struct zset { dict *dict; // Maintains the mapping of member to score, used to look up the score of a given member zskiplist *zsl; // Saves all set elements sorted by score, supports range operations } zset; // dict and zsl will share members and scores +----------------------+ +--------+ +------------+ +---------+ | redisObject | +--&gt;| dictht | | StringObj | -&gt; | long | +----------------------+ +-------+ | +--------+ +------------+ +---------+ | type | +--&gt;| dict | | | table | --&gt; | StringObj | -&gt; | long | | REDIS_ZSET | | +-------+ | +--------+ +------------+ +---------+ +----------------------+ | | ht[0] | --+ | StringObj | -&gt; | long | | encoding | +--------+ | +-------+ +-----+ +------------+ +---------+ | OBJ_ENCODING_ZIPLIST | | zset | | | L32 | -&gt; NULL +----------------------+ +--------+ | +-----+ | ptr | ---&gt; | dict | --+ | ... | +----------------------+ +--------+ +--------+ +-----+ +-----------+ +-----------+ | zsl | ---&gt; | header | --&gt; | L4 | -&gt; | L4 | ------------------&gt; | L4 | -&gt; NULL +--------+ +--------+ +-----+ +-----------+ +-----------+ | tail | | L3 | -&gt; | L3 | ------------------&gt; | L3 | -&gt; NULL +--------+ +-----+ +-----------+ +-----------+ +-----------+ | level | | L2 | -&gt; | L2 | -&gt; | L2 | -&gt; | L2 | -&gt; NULL +--------+ +-----+ +-----------+ +-----------+ +-----------+ | length | | L1 | -&gt; | L1 | -&gt; | L1 | -&gt; | L1 | -&gt; NULL +--------+ +-----+ +-----------+ +-----------+ +-----------+ NULL &lt;- | BW | &lt;- | BW | &lt;- | BW | +-----------+ +-----------+ +-----------+ | StringObj | | StringObj | | StringObj | +-----------+ +-----------+ +-----------+ | long | | long | | long | +-----------+ +-----------+ +-----------+ The specific encoding used for zset depends on the following two options: zset-max-ziplist-value：Uses ziplist encoding when all member&#39;s size is less than this value (default is 64) zset-max-ziplist-entries：Uses ziplist encoding when the number of elements in the zset is less than this value (default is 128) Overall of Redis Each database is a redisDb structure: typedef struct redisDb { dict *dict; /* Key space of the database */ dict *expires; /* Set of keys with an associated expire */ dict *blocking_keys; /* Keys with clients waiting for data (BLPOP)*/ dict *ready_keys; /* Blocked keys that received a PUSH */ dict *watched_keys; /* Keys that are WATCHED for MULTI/EXEC CAS */ int id; /* Database ID */ long long avg_ttl; /* Average TTL, just for stats */ unsigned long expires_cursor; /* Cursor to iterate over expiring keys */ list *defrag_later; /* List of key names to attempt to defrag one by one in the background. */ } redisDb; All databases in Redis are stored in the redisServer.db array, and redisServer.dbnum stores the number of databases. A simplified memory layout is approximately as follows: +-------------+ | redisServer | +-------------+ +------------+------+-------------+ | db | -&gt; | redisDb[0] | .... | redisDb[15] | +-------------+ +------------+------+-------------+ | dbnum | | | 16 | | +-------------+ | +---------+ +------------+ +-&gt;| redisDb | +-&gt; | ListObject | +---------+ +------------+ | +------------+ | dict | -&gt; | StringObj | --+ +---------+ +------------+ +------------+ | expires | | StringObj | ----&gt; | HashObject | +---------+ +------------+ +------------+ | | StringObj | --+ | +------------+ | +------------+ | +-&gt; | StringObj | | +------------+ | | +------------+ +-------------+ +----&gt; | StringObj | -&gt; | long | +------------+ +-------------+ | StringObj | -&gt; | long | +------------+ +-------------+ "},{"slug":"consistency-with-raft","title":"Raft Protocol","tags":["SystemDesign","DistributedSystem"],"content":"The Raft protocol is a distributed consensus protocol used to maintain consistency of replicated logs. It ensures consistency in cases of node failures or network partitions through leader election, log replication, and safety mechanisms. Paxos Issues The description of the Paxos algorithm is very academic and lacks many details, making it difficult to apply directly in engineering. Most distributed algorithms used in practical engineering are variants of Paxos, and verifying the correctness of these algorithms is also a difficult problem. For example, the last section of the previous article introduces an engineering model that applies the Paxos algorithm. This model has obvious write performance bottlenecks: Using a multi-master architecture, the probability of write conflicts is high Each update operation requires at least two rounds of network communication, resulting in high communication overhead If you want to improve the performance of this model, you still need to make further adjustments in many details. The final algorithm is very different from the original version of Paxos. In order to solve the above problems, Raft, another high-performance and easy-to-understand consistency algorithm has emerged. To learn algorithms, a fully functional Raft protocol was implemented in Java: rafting. The code is faithful to the original paper, which may help you understand this protocol better. Basic Concepts The Raft algorithm is based on the RSM (Replicated State Machine), and is essentially an algorithm for managing log replication. The Raft cluster uses a single-leader architecture, and there is a unique Leader process in the cluster responsible for managing log replication. Its responsibilities include: Accept requests sent by the client Synchronize log records to other processes Inform other processes when they can commit logs Replicated State Machine The essence of the replicated state machine is: Paxos + WAL Each process maintains a state machine and uses a log to store the instructions it needs to execute\\ If two state machines execute the same instructions in the same order, then the two processes can eventually converge to the same state. If the logs of all processes can be guaranteed to be consistent, then the state of each process must also be consistent Term In order to reduce unnecessary network communication, the order of log addition is determined by the unique Leader in the cluster, without the need to negotiate with other nodes. The communication overhead is reduced from the minimum of 2 times to a fixed 1 time, which greatly improves the performance of the algorithm. For availability reasons, after the current leader goes offline, the cluster needs to select a new leader from the surviving nodes. This process is called leader election. Each election will generate a new term number which increase monotonically increasing. If a new leader is generated in the election, then this term number will accompany this leader until it goes offline. Each participant process maintains a current_term to represent the latest known term. Processes exchange this value with each other to detect changes in leadership. /** * Basic information */ public abstract class RaftMember implements RaftParticipant { // These two properties need to be persisted before responding to RPC protected final long currentTerm; // The latest known term (initialized as 0) protected final ID lastCandidate; // The candidate who received the last vote protected RaftMember(long term, ID candidate) { this.currentTerm = term; this.lastCandidate = candidate; stableStorage().persist(currentTerm, lastCandidate); } /** * @see RaftParticipant#currentTerm() */ @Override public long currentTerm() { return currentTerm; } /** * @see RaftParticipant#votedFor() */ @Override public ID votedFor() { return lastCandidate; } } Log The log is a core concept of Raft.\\ Raft guarantees that logs are continuous and consistent, and can eventually be submitted by all processes in the order of their log indexes. Each log entry contains: Term: the term of the Leader that generated this record Index: its sequence in the log Command: executable state machine instructions Once a command in a log is executed by the state machine, we call this record committed. Raft ensures that committed records are not lost. Roles Each process in the Raft cluster can only assume one of the following roles: Leader: sends heartbeats, manages log replication and submission Follower: passively responds to requests sent by other nodes Candidate: actively initiates and participates in elections Raft processes communicate with each other using RPC. Implementing the most basic consensus algorithm only requires two types of RPC: RequestVote: used to elect a Leader AppendEntries: replicate logs and send heartbeats /** * RPC interface * */ public interface RaftService { /** * Replicate logs + send heartbeats (called by the leader) * @param term leader term * @param leaderId unique identifier of the leader in the cluster * @param prevLogIndex index of the log entry immediately preceding the new one * @param prevLogTerm term of prevLogIndex * @param entries log entries (empty when sending heartbeats) * @param leaderCommit index of the log entry already committed by the leader * @return true if the follower&#39;s log contains a log entry matching prevLogIndex and prevLogTerm * */ Async&lt;RaftResponse&gt; appendEntries( long term, ID leaderId, long prevLogIndex, long prevLogTerm, Entry[] entries, long leaderCommit) throws Exception; /** * Leader election (called by the candidate) * @param term candidate term * @param candidateId unique identifier of the candidate in the cluster * @param lastLogIndex index of the last log entry of the candidate * @param lastLogTerm term of the last log entry of the candidate * @return true when receiving a vote of approval * */ Async&lt;RaftResponse&gt; requestVote( long term, ID candidateId, long lastLogIndex, long lastLogTerm) throws Exception; } Algorithm Process Based on the single-leader model, Raft decomposes the consistency problem into 3 independent subproblems: Leader Election: Automatically elect a new leader after the Leader process fails Log Replication: The Leader ensures that the logs of other nodes are consistent with its own Safety: The Leader ensures that the order and content of executing commands by the state machine are consistent For easier understanding, the following explanation is accompanied by animations. Election Using heartbeat timeout mechanism to trigger leader election: Nodes start as Follower by default. If a Follower does not receive heartbeat information from the Leader within a timeout period, it transitions to Candidate and initiates RequestVote requests to other nodes. Once a Candidate receives votes from a majority, it becomes the Leader and starts sending AppendEntries requests to other nodes to maintain its leadership. After the Leader fails to send heartbeats, the heartbeat timeout mechanism of the Followers is triggered again, starting a new round of elections. Replication Only the Leader provides services to the outside world in the cluster. When a client communicates with the Leader, each request contains a command that can be executed by the state machine. Upon receiving a command, the Leader converts it into a corresponding log entry and appends it to its local log. It then calls AppendEntries to replicate this log entry to the logs of other nodes. When the log is replicated to a majority of nodes, the Leader commits the command contained in this log entry to be executed by the state machine, and finally informs the client of the execution result. Network Partition Using the majority mechanism to handle network partitions: After a network partition occurs, multiple Leaders may appear simultaneously in the cluster. The replication mechanism ensures that at most one Leader can provide services normally. If logs cannot be replicated to a majority of nodes, the Leader will reject the submission of these logs. When the network partition disappears, the cluster will automatically return to a consistent state. Safety Guarantee Leader Election Ensure that the new Leader has all committed logs. Each Follower checks the log index of the Candidate when voting and refuses to vote for a Candidate with incomplete logs. If more than half of the Followers vote in favor, it means that the Candidate contains all potentially committed logs. Logs Committing Leader only actively commits logs generated during its own term. If a record is created by the current Leader, then when this record is replicated to a majority of nodes, the Leader can submit this record and the preceding records for execution. If a record was created by a previous Leader, only when the records created by the current Leader are committed, can these logs created by the previous Leader be committed. Summary The essence of consistency algorithms is the trade-off between consistency and availability. Raft&#39;s advantages Simplified log management with Single Leader architecture. All logs are flowed from the Leader to other nodes without the need to negotiate with other nodes. Other nodes only need to record and apply the log content sent by the Leader, optimizing the original two-phase request to a single RPC call. Raft&#39;s disadvantages High requirements for log continuity. To simplify log management, Raft does not allow gaps in logs, limiting concurrency. In some scenarios, it is necessary to decouple unrelated business using the Multi-Raft mode to increase system concurrency. "},{"slug":"consistency-with-paxos","title":"Paxos Protocol","tags":["SystemDesign","DistributedSystem"],"content":"Paxos is a distributed consensus algorithm used to achieve consensus in distributed systems. It provides a message-passing mechanism to ensure system reliability and consistency in the face of network and node failures. Availability and Consistency To provide users with a better service experience, modern software architectures are increasingly focusing on system availability. Driven by this trend, microservices and containerization technologies have become increasingly popular today. High availability architectures are based on redundancy: A highly available service must consist of multiple processes that back each other up, so that the failure of some processes does not cause the entire service to become unavailable. If the service is stateful, each process needs to maintain its own replica of the state. In order to ensure the substitutability of stateful processes, how to maintain the consistency of these replicas becomes a crucial issue. Take a distributed coordination service that provides locking services as an example. This type of service must meet the following two characteristics: High availability: Service failures will cause downstream services to become unavailable Strong consistency: Downstream services observing inconsistent states will cause lock failures According to the CAP theory of distributed databases, it is difficult to achieve the following 3 characteristics at the same time: When network communication is normal, the distributed database can guarantee both C and A When a network partition occurs (processes cannot communicate normally), the database must make a trade-off between C and A: AP system —— Guarantee availability, sacrifice consistency Each node can provide external services, but the entire database will be in an inconsistent state CP system —— Guarantee consistency, sacrifice availability Each node refuses to provide external services, but the entire database will always maintain a consistent state Within the CAP framework, we seem to be in a dilemma: is it impossible to implement a CA system? Before answering this question, it is first necessary to point out a problem with the CAP theory: The CAP theory only considers 3 factors when making trade-offs, and ignores other indicators such as performance, implementation complexity, number of machines... For example, there is a popular view that ZooKeeper is a CP system. However, in a ZooKeeper cluster, as long as a majority of the nodes can communicate normally, the cluster can still provide external services normally. This means: ZooKeeper is still available and consistent in the event of a network partition. To ensure that the system is available and consistent in the event of a network partition, some additional costs need to be paid. Taking ZK as an example, the price it pays is more machine resources (3 or more nodes must be deployed) and higher system complexity (using consensus algorithms). Consensus The term consensus refers to reaching an agreement on something. The consensus problem in distributed systems can be described as follows: There is a set of processes in the system that can propose proposals. One or more processes propose a proposal, and then a proposal is chosen from them as the final result through a consensus algorithm. block-beta columns 1 block:MS columns 3 space C(((&quot;Client&quot;))) space space:3 L1[(&quot;Leader&quot;)] block:SYNC columns 1 S1&lt;[&quot;&amp;nbsp;x = 1&quot;]&gt;(right) S2&lt;[&quot;x = 2&amp;nbsp;&quot;]&gt;(left) end L2[(&quot;Leader&quot;)] end C -- &quot;write&lt;br/&gt;[x=1]&quot; --&gt; L1 C -- &quot;write&lt;br/&gt;[x=2]&quot; --&gt; L2 style SYNC fill:none,stroke:none block-beta block:U columns 3 block:Proposal columns 1 block:P1[&quot;Proposal 1&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot;] space PS1[&quot;x=1;x=2;&quot;] end block:P2[&quot;Proposal 2&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot;] space PS2[&quot;x=2;x=1;&quot;] end end C&lt;[&quot;&amp;nbspChosen&quot;]&gt;(right) block:Consensus columns 1 block:V[&quot;Consensus&quot;] Consensus end block:W Result[&quot;x = ?&quot;] end end end style P1 fill:none,stroke:none style P2 fill:none,stroke:none style U fill:none,stroke:none style V fill:none,stroke:none style W fill:none,stroke:none Take above scenario as example: it illustrates a multi-master distributed database where multiple nodes are allowed to receive write operation requests, and they exchange modification instructions to ensure data consistency. At some point, two clients simultaneously initiate change operations on the same record, thus potentially having two different update sequences (proposals) at the same time. To ensure consistency of data replicas, the two databases must reach a consensus on the update sequence. Fault Tolerance Due to the uncertainty of the operating environment, system failures are always inevitable. These failures can be roughly divided into two categories: Non-Byzantine Faults Processes communicate messages through unreliable networks Processes may experience freezes, crashes, and restarts during operation, but they will not send incorrect messages Messages may be lost, duplicated, or out of order during transmission, but they will not be corrupted or tampered with Byzantine Faults Malicious processes may appear in the system and deliberately send incorrect messages to other processes This can cause other processes to behave abnormally, leading to the failure of the entire system Therefore, consensus algorithms can be divided into two categories: Byzantine Fault Tolerance is mainly applied in the blockchain field, using high computational overhead to eliminate the possibility of malicious processes.It is powerful but not suitable for providing high-performance consistency guarantees Non-Byzantine Fault Tolerance is mainly for data management services.It reduces fault tolerance overhead by introducing additional security assumptions, which can provide high-performance consistency guarantees Our discussion scope is limited to the non-Byzantine fault-tolerant consensus algorithm. This type of algorithm can provide the following guarantees for the system in an uncertain environment: Validity: Proposals must be proposed by a process and cannot appear out of thin air Integrity: Each process can only vote once and cannot change the result Consistency: The final decisions of all processes must be consistent The core of fault-tolerant consensus algorithms can be summarized as follows: Safety: All processes converge to a consistent and legitimate state Fault Tolerance: The system can continue to operate even if a small number of processes crash Paxos Algorithm The Paxos algorithm is one of the most classic consistency protocols. Basic Concepts Proposal: Each proposal [n,v] consists of two parts: n: The order in which the proposal was made v: The value of the proposal Roles: Each process can play one or more of the following roles at the same time: Proposer: Proactively initiates proposals Acceptor: Votes on proposals Learner: Passively accepts voting results Proposal Status: Each proposal can be in one of the following three states: Proposed: Proposed by a Proposer Accepted: Accepted by an Acceptor Chosen: Accepted by a majority of Acceptors Communication Model: Communication between processes is done through message passing, and messages are one-way. The message receiving process can choose to refuse to respond. The network is unreliable, and processes need to handle message loss and out-of-order situations themselves. block-beta columns 5 space:2 A1((&quot;Acceptor&quot;)) space:2 P1((&quot;Proposer&quot;)) space:4 space:2 A2((&quot;Acceptor&quot;)) space L((&quot;Lerner&quot;)) P2((&quot;Proposer&quot;)) space:4 space:2 A3((&quot;Acceptor&quot;)) space:2 P1 ----&gt; A1 P2 ---&gt; A1 P1 ----&gt; A2 P2 ---&gt; A2 P1 ---&gt; A3 P2 ---&gt; A3 A1 ---&gt; L A2 ---&gt; L A3 ---&gt; L Voting The basic idea of Paxos is to vote on multiple proposals to select one as the final consensus, and once this proposal is selected, the result will not change. The roles involved in the voting are Proposer (actively initiates requests) and Acceptor (passively responds to requests), while Learner only cares about the voting results and does not participate in the voting process itself. Each round of voting in Paxos consists of two phases: Generate Proposal (Prepare) -&gt; Vote Proposal (Accept) In order to support the voting process, the Acceptor process needs to maintain the following two states locally: $n^{\\text{max_prepare}}$ ：The largest proposal number of Prepare requests that have been responded to $[n^{\\text{max_accept}}, v&#39;]$ ：The proposal with the largest number that has been accepted through Accept requests To better understand the voting process, let&#39;s describe it with a specific scenario: When the cluster starts, it needs to elect a Leader process. Two processes named Alice and Bob are running for the election. They respectively propose a proposal to elect themselves as Leader in the role of Proposer A / B. And send a voting process to 3 Acceptors X / Y / Z in the cluster. Prepare phase Before proposing a proposal [n,v], the Proposer must ensure that the number n is not occupied by other proposalsTherefore, the Proposer sends a prepare request with number n to multiple Acceptors to lock this number First, the two Proposers obtain the available proposal number through some means (such as: maintaining a global counter, asking all Acceptors, etc.), A and B respectively obtain the available numbers as 2 and 3. Then use this number to issue prepare requests to the Acceptor cluster. After Acceptor receiving a prepare request with number n: If $n &gt; n^{\\text{max_prepare}}$，returns the largest proposal with a known number$[n^{\\text{max_accept}}, v&#39;]$ Otherwise, refuse to respond to this request X and Y received A and B&#39;s requests one after another, and B&#39;s number is larger than A&#39;s, so A and B received responses. Z first received B&#39;s request and then received A&#39;s request, so it only responded to B&#39;s request and ignored A&#39;s request. Only when the Proposer receives responses from a majority of Acceptors in the prepare phase can it initiate an accept request In the prepare phase, A and B both received responses from a majority, so they both can enter the accept phase. Accept phase After locking the number n, the Proposer needs to generate a new proposal $[n, v^{\\text{new}}]$ before initiating an accept request. The generation of the proposal value $v^{\\text{new}}$ needs to follow the following rules: All known proposals received by the Acceptor form a set $S^{\\text{accept}}$ If $S^{\\text{accept}}=\\emptyset$ (no known proposals exist), $v^{\\text{new}}$ can be specified as any value If $S^{\\text{accept}}\\ne\\emptyset$, then find the known proposal with the largest number $[n^{\\text{max}}, v&#39;] \\in S^{\\text{accept}}$ and let $v^{\\text{new}}= v&#39;$ In the response to the prepare phase, A and B did not receive any known proposals, so they both used their own process ID as $v^{\\text{new}}$ and generated new proposals and initiated accept requests. When the Acceptor receives an accept request for a proposal of $[n^{\\text{new}},v^{\\text{new}}]$ If $n^{\\text{new}} \\ge n^{\\text{max_prepare}}]$, accept the proposal and update $[n^{\\text{max_accept}},v&#39;]=[n^{\\text{new}},v^{\\text{new}}]$ Otherwise, refuse to accept this proposal Since X and Y have already responded to B&#39;s prepare request, they consider A&#39;s proposal to be expired and refuse to accept its accept request. Since the known prepare requests with the largest number by X, Y, and Z are all issued by B, B&#39;s accept request is successfully passed. In the end, process Alice was defeated and process Bob was elected as the new leader. Tell learner How does the Learner learn about the voting results? There two common ways: After each proposal is passed, the Acceptor actively notifies the Learner Pros: High real-time performance, high efficiency when the number of Learners is low. Cons: High network complexity, and the notification messages from the Acceptor may also be lost, requiring a retry mechanism. The Learner actively polls the Acceptor for information on passed proposals Pros: High reliability, able to handle situations where the Acceptor crashes or messages are lost. Cons: Poor real-time performance, need to control the polling frequency carefully. Safety 一旦半数以上的 Acceptor 通过了某个提案后，意味着该提案被选定了并且这个被选定的提案值 $v^{\\text{chosen}}$ 今后不会再发生改变 Once more than half of the Acceptors have accepted a proposal, it means that the proposal has been selectedAnd the selected proposal value $v^{\\text{chosen}}$ will not change in the future Since A&#39;s accept request did not pass before, a new round of voting was initiated. Although A&#39;s accept request was passed this time, the selected proposal value is the previously selected Bob. Fault tolerance As long as more than half of the Acceptors are alive, the voting mechanism can still be functional, and the system is still available at this time Assuming that one of the Acceptors crashed in the previous accpet phase, the remaining two Acceptors in the system can still continue the subsequent voting process, and the system is still available. Coner cases In the accept phase, X and Y passed and selected the proposal value Bob. Due to message loss, Z did not pass the proposal. Later, the X and Y processes crashed, and only Z was left in the system to function normally. At this time, A initiated a new round of voting. Due to the majority restriction, A cannot generate a proposal with a value of Alice, and the system state will not be changed. In the accept phase, X passed the proposal value Bob. Due to message loss, Y and Z did not pass the proposal. Later, the X process crashed, and only Y and Z were left in the system to function normally. At this time, A initiated a new round of voting. Due to the majority restriction, the previous proposal value Bob was not selected, and the system selected the value Alice proposed by A. Safety Proof The safety of the Paxos algorithm guarantees: Validity: The proposals involved in the decision must come from a Proposer Integrity: A proposal with number n can only be proposed by a Proposer once, and can only be voted on by the same Acceptor once Consistency: Once the proposal $[n,v^{\\text{chosen}}]$ is selected, all subsequent proposals will contain $v^{\\text{chosen}}$ The consistency of Paxos is achieved by constraining the generation of proposals. The following is a simple proof: Assumption The first selected proposal is $[m^0,v^0]$, then there must be a majority Acceptor set $S^0$ that has passed the accept request with number $m^0$. The next proposed proposal is $[m^1,v^1]$, then there must be a majority Acceptor set $S^1$ that has responded to the prepare request with number $m^1$. Proof: $v^1=v^0$ According to the majority principle, the intersection of $S_{0}$ and $S_{1}$ must be non-empty, and there must be at least one Acceptor in $S_{1}$ who has passed the proposal $[n_{0},v_{0}]$.So the maximum $n^{\\text{max_accept}}$ of all Acceptors in the intersection of $S_{0}$ and $S_{1}$ can only be $m_{0}$ According to the conditions for a successful prepare, the maximum $n^{\\text{max_accept}}$ among all proposals passed by all Acceptors in $S_{1}$ must be less than $m_{1}$. Suppose the maximum number of all proposals passed by all Acceptors in $S_{1}$ is $x$, obviously there is $m_{0} \\le x &lt; m_{1}$.Since $x$ must be proposed by a proposal before $m_{1}$, then $x$ must be $m_{0}$. Therefore, in the response of prepare, there must be at least one Acceptor who has passed th e$m_{0}$ proposal to return the proposal $[n_{0},v_{0}]$.Therefore, $v0 = v1$ is statisfied. According to mathematical induction, $v_{n+1} = v_{n}$ can also be proved. Application Idea In practical engineering applications, the original Paxos algorithm is not used directly, but we can still try this algorithm to implement a strongly consistent distributed database. block-beta columns 4 block:WR[&quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Write Op&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&quot;]:2 columns 4 C1(((&quot;Client&quot;))) C2(((&quot;Client&quot;))) C3(((&quot;Client&quot;))) space:6 LOG1&gt;&quot;&amp;nbsp;&amp;nbsp;&amp;nbsp;Log&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot;] space DB1[(&quot;Database&quot;)] space:6 DISK[[&quot;&lt;br/&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;Disk&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;br/&gt;&amp;nbsp;&quot;]] end block:RE[&quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Read Op&quot;]:2 columns 3 space C4(((&quot;Client&quot;))) space DB2[(&quot;Database&quot;)] space LOG2&gt;&quot;&amp;nbsp;&amp;nbsp;&amp;nbsp;Log&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot;] end C1 -- &quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;1. append-log&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot; --&gt; LOG1 C2 --&gt; LOG1 C3 --&gt; LOG1 LOG1 -- &quot;3. apply-change&lt;br/&gt;&amp;nbsp;&lt;br/&gt;&amp;nbsp;&lt;br/&gt;&amp;nbsp&quot; --&gt; DB1 LOG1 -- &quot;2. fsync-log&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot; --&gt; DISK DB1 -- &quot;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;4. fsync-db&quot; --&gt; DISK DB2 -- &quot;1. read-db &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;br/&gt;&amp;nbsp;&lt;br/&gt;&amp;nbsp;&lt;br/&gt;&quot; --&gt; C4 LOG2 -- &quot; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;2. merge-change &lt;br/&gt;&amp;nbsp;&lt;br/&gt;&amp;nbsp;&lt;br/&gt;&quot; --&gt; C4 In the database field, Write-Ahead Logging (WAL) is a common way to improve database performance: Any command that modifies the state of the database is first written to a sequential log and flushed to disk, and then applied to the database in the order in which the commands were written by an asynchronous thread. When reading data, the state in the database is merged with the state in the WAL to ensure that the data returned to the client is the latest. This mechanism can provide high write performance while ensuring data integrity: even if the process crash, data will not be lost. The entries in the WAL are ordered, and each entry contains one or a group of atomic change commands. According to the conclusion of the previous article, as long as all databases execute the same commands in the same order, the states of these two databases can be kept consistent. In other words, by using the Paxos algorithm to ensure that the content of the local WAL copies of multiple databases is completely consistent, the database can eventually converge to a consistent state. Suppose there is a KV database similar to Redis, where users can send instructions to the database to store and retrieve key-value pairs. This database supports a multi-master architecture and uses Paxos + WAL to guarantee data consistency: graph LR C1((&quot;Client&quot;)) subgraph X1[&quot;Leader 1&quot;] AX1[&quot;Accpetor&quot;] PX[&quot;Proposer&quot;] end subgraph Y1[&quot;Leader 2&quot;] AY1[&quot;Accpetor&quot;] end subgraph Z1[&quot;Leader 3&quot;] AZ1[&quot;Accpetor&quot;] end C1 ---&gt;|&lt;br/&gt;&lt;br/&gt;request| PX X1 ~~~ Y1 X1 ~~~ Z1 PX &lt;---&gt;|&lt;br/&gt;&lt;br/&gt;prepare / accept| AX1 PX &lt;---&gt;|&lt;br/&gt;&lt;br/&gt;prepare / accept| AY1 PX &lt;---&gt;|&lt;br/&gt;&lt;br/&gt;prepare / accept| AZ1 block-beta block:x then&lt;[&quot;2PC Paxos&quot;]&gt;(down) end style x display:none %%{init: {&quot;flowchart&quot;: {&quot;rankSpacing&quot;: 5, &quot;nodeSpacing&quot;: 5 }} }%% flowchart TB subgraph Fin[&quot; &quot;] direction LR C1((&quot;Client&quot;)) LOG&gt;&quot;&amp;nbsp;&amp;nbsp;WAL&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot;] subgraph X1[&quot;Leader 1&quot;] AX1[&quot;Accpetor&quot;] LX1[&quot;Learner&quot;] end end subgraph Rest direction TB subgraph Y1[&quot;Leader 2&quot;] direction LR AY1[&quot;Accpetor&quot;] LY1[&quot;Learner&quot;] end subgraph Z1[&quot;Leader 3&quot;] direction LR AZ1[&quot;Accpetor&quot;] LZ1[&quot;Learner&quot;] end end LX1 ---&gt;|&lt;br/&gt;&lt;br/&gt;response| C1 LX1 ---&gt;|&lt;br/&gt;&lt;br/&gt;chosen| LOG AX1 ---&gt;|&lt;br/&gt;&lt;br/&gt;accepted| LX1 AY1 ---&gt;|&lt;br/&gt;&lt;br/&gt;accepted| LY1 AZ1 ---&gt;|&lt;br/&gt;&lt;br/&gt;accepted| LZ1 Fin ~~~ Rest classDef hidden fill:none class Fin,Rest hidden When a database X receives a request: Propose a proposalAs the Proposer, X writes a record to a specific location in the WAL (Write-Ahead Log) to propose a change. This proposal is then submitted to the Acceptor group consisting of X, Y, and Z for voting. Acceptor votingEach Acceptor in the group votes on the proposal. If a majority of Acceptors (at least two out of three) approve the proposal, they move to the next step. Learner notificationOnce a majority of Acceptors have approved the proposal, they inform the Learner group (also consisting of X, Y, and Z) of the outcome. Learner applicationUpon receiving the proposal approval from the Acceptor group, the Learner(s) recognize that the proposal has been selected and the corresponding WAL location is available for writing. The Learner(s) then append the proposed command to their local WAL (the WAL of Y and Z nodes is omitted in the diagram) and return a response to the client. Retrial on rejectionIf the proposal is not approved by a majority of Acceptors (e.g., due to network issues or conflicts with other proposals), X initiates another round of voting until the proposal is accepted or the client request times out. Here&#39;s a visual representation of the process from the WAL&#39;s perspective: %%{init: {&quot;flowchart&quot;: {&quot;diagramPadding&quot;: 50 }} }%% block-beta columns 1 CLI(((&quot;Client&quot;))) block:cluster[&quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Multi-Leader-Cluster&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&quot;]:1 DB1[(&quot;Leader 1&quot;)] DB2[(&quot;Leader 2&quot;)] DB3[(&quot;Leader 3&quot;)] end block:paxos[&quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Paxos 2PC&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&quot;]:1 columns 5 block:pa[&quot;proposal A&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot;] columns 1 space PC1[&quot;prepare/accept&quot;] C1&lt;[&quot;chosen&quot;]&gt;(down) end space block:pb[&quot;proposal B&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot;] columns 1 space PC2[&quot;prepare/accept&quot;] C2&lt;[&quot;chosen&quot;]&gt;(down) end space block:pc[&quot;proposal C&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot;] columns 1 space PC3[&quot;prepare/accept&quot;] C3&lt;[&quot;chosen&quot;]&gt;(down) end end block:wal[&quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;WAL&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&amp;emsp;&quot;]:1 columns 5 log1&gt;&quot;&amp;nbsp;index = 1&amp;nbsp;&quot;] o1&lt;[&quot; &quot;]&gt;(right) log2&gt;&quot;&amp;nbsp;index = 2&amp;nbsp;&quot;] o2&lt;[&quot; &quot;]&gt;(right) log3&gt;&quot;&amp;nbsp;index = 3&amp;nbsp;&quot;] cmd1[[&quot;&amp;nbsp;&amp;nbsp;&amp;nbsp;x=1&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot;]] space cmd2[[&quot;&amp;nbsp;&amp;nbsp;&amp;nbsp;x++&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot;]] space cmd3[[&quot;&amp;nbsp;&amp;nbsp;&amp;nbsp;x-=2&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot;]] end block:rsm:1 columns 3 space sm&lt;[&quot;Statue Machine&quot;]&gt;(down) space space res[&quot;x = 0&quot;] space end DB1 ---&gt; pc DB2 ---&gt; pa DB3 ---&gt; pb CLI -- &quot;x-=2&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot; --&gt; DB1 CLI -- &quot;x=1&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot; --&gt; DB2 CLI -- &quot;x++&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot; --&gt; DB3 style res display:none style rsm display:none "},{"slug":"consistency-in-distributed-systems","title":"Distributed Consistency With Asynchronous Replication","tags":["SystemDesign","DistributedSystem"],"content":"This article is a book note from &quot;Designing Data-Intensive Applications&quot;. Replication Consistency In modern database systems, replication mechanisms are almost ubiquitous. This design approach brings at least two benefits to the system: Disaster Recovery with Multiple Replicas: As long as one data replica is available, data can be recovered Horizontal Scaling of Read Performance: By distributing data across different machines, the same data can be accessed simultaneously from multiple nodes externally Ensuring the data consistency of multiple replicas is a challenge. The simplest way to implement is to use a synchronous replication mechanism (sync-replication):Ensure that write operations are successful on all replicas before responding to the client. However, this method usually means poor write performance, so it is rarely used. In contrast, there is the asynchronous replication mechanism (async-replication):The client can be responded to after the write operation is successful on some replicas, and the database will asynchronously synchronize the changes to the remaining replicas. Its advantage is high write performance, but the consistency of data replicas cannot be guaranteed. Taking the most basic master-slave replication architecture as an example, although the master and slave libraries will eventually reach a consistent state, there will be a time delay in the synchronization of the master-slave state. This delay is called replication lag. During this period, two conflicting data replicas may exist at the same time. If applications that rely on this data do not do a good job of preventive processing, it will eventually lead to abnormal system behavior. In distributed databases, the processes that maintain the replica state are divided into two categories: leader / master: Processes that can handle both read and write requests follower / slave: Processes that can only handle read requests Based on the above definitions, common replication architectures can be divided into the following three categories: block-beta columns 9 space A[&quot;Master-Slave&quot;] space:2 B[&quot;Single-Leader&quot;] space:2 C[&quot;Multi-Leader&quot;] space block-beta columns 3 block:MS:1 columns 3 space C1(((&quot;Client&quot;))) space space:3 M1[(&quot;Master&quot;)] space S1[(&quot;Slave&quot;)] end block:SL:1 columns 3 space C2(((&quot;Client&quot;))) space space:3 F21[(&quot;Follower&quot;)] space F22[(&quot;Follower&quot;)] space:3 space L2[(&quot;&amp;nbsp;Leader&amp;nbsp;&quot;)] space end block:ML:1 columns 3 space C3(((&quot;Client&quot;))) space space:3 L31[(&quot;Leader&quot;)] space L32[(&quot;Leader&quot;)] block:BL1[&quot;WAL&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot;] columns 1 R11[&quot;&amp;nbsp;x=1&amp;nbsp;&quot;] R12[&quot;&amp;nbsp;x=x+1&amp;nbsp;&quot;] end space block:BL2[&quot;WAL&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot;] columns 1 R21[&quot;&amp;nbsp;x=x+1&amp;nbsp;&quot;] R22[&quot;x=1&quot;] end end C1 -- &quot;1.write&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;br/&gt;&lt;br/&gt;&quot; --&gt; M1 C1 -- &quot;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp;2.read&lt;br/&gt;&lt;br/&gt;&quot; --&gt; S1 M1 -- &quot;&lt;br/&gt;&lt;br/&gt;sync&quot; --&gt; S1 C2 -. &quot;1.read&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;br/&gt;&lt;br/&gt;&quot; .-&gt; F21 C2 -- &quot;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp;2.read&lt;br/&gt;&lt;br/&gt;&quot; --&gt; F22 L2 -- &quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;sync&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;br/&gt;fast&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp;&quot; --&gt; F21 L2 -- &quot;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp;sync&lt;br/&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp;slow&quot; --&gt; F22 C3 -. &quot;write&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp; [x=1]&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot; .-&gt; L31 C3 -- &quot;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp;write &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp&amp;nbsp;&amp;nbsp;&amp;nbsp;[x=x+1]&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&quot; --&gt; L32 L31 ----&gt; L32 L32 -- &quot;&lt;br/&gt;&lt;br/&gt;sync&quot; --&gt; L31 Different architectures face different consistency issues Master-Slave &nbsp;&nbsp;-&gt;&nbsp;&nbsp; Read-after-write consistency PhenomenaThe client modifies the data, the change is not synchronized to the slave, and at this time the data is read from the slave, and the result obtained is the unmodified result. SolutionMake write operations are immediately visible to subsequent read operations. According to the characteristics of the application, force some functions to only access the master library to ensure that the read and write order is consistent Maintain the user modification timestamp and the slave modification timestamp, and select whether to read the slave based on the timestamp Single-Leader &nbsp;&nbsp;-&gt;&nbsp;&nbsp; Monotonic read consistency PhenomenaThe client reads the same record multiple times, but the request is routed to different slaves, and it may read old data. SolutionThe data version that read later must newer than the verion read before.Ensure that requests from the same user are only routed to the same slave, and ensure that the read order is consistent. Multi-Leader / Leaderless &nbsp;&nbsp;-&gt;&nbsp;&nbsp; Causal consistency PhenomenaThe client modifies data multiple times and routes to different masters, and there is a causal relationship between the data (e.g. question and answer records). You may read data in a chaotic order, or you may modify data that does not yet exist (e.g. network latency between leaders). SolutionThe results of write operations must be read in the order in which they are executed.Write operations with causal relationships are executed on the same master, and the write order is guaranteed to be consistent. Conflict Resolution In large-scale Internet applications, multi-data centers are becoming increasingly popular, with the following advantages: Close geographic location, fast access speed High availability, a single data center outage or network problem will not lead to unavailability When the system needs to be deployed to multiple data centers, the multi-leader architecture will inevitably be used, which brings the following problems: The same data may be modified concurrently by two data centers, resulting in write conflicts Some features of the database do not well support the multi-leader architecture, such as auto-increment primary keys and triggers Write Conflicts Writes under the single-leader architecture are sequential, and modifications to the same data can achieve a final consistent result on each replica. Writes in each leader under the multi-leader architecture are also orderly, but write operations between different leaders are disorderly, so modifications to the same data are also disorderly, which may eventually lead to inconsistent replica states. To ensure that all replica converge to the same state, conflicts resolution is necessary: Associate each write operation with a unique ID and select the result with the highest priority Associate each replica with a unique ID and select the result corresponding to the replica with the highest priority Merge two conflicting data into the same data Save all conflicting data for subsequent operations to resolve According he timing of conflict resolution, they can be divided into: Resolve at writeInject conflict resolution logic code into the database, and the database will be called when a conflict occurs. For example: MySQL MGR achieves consistency through maintaining a globally consistent Binlog. Resolve at readWhen there is conflicting data, the application will obtain this conflicting data and automatically or manually resolve these conflicts. For example: Dynamo NRW guarantees that the latest data is read by adjusting the number of read and write replicas. Consistency Models As a developer, one of the important questions we care about is: what level of consistency guarantee does the database itself provide us with? In order to support concurrent operations, databases introduce the concept of transactions to avoid data inconsistencies that lead to abnormal behavior. The transaction model of the database has an important concept: transaction isolation level Repeatable Read Serializable Isolation Level Read Uncommitted Read Committed Repeatable Read Serializable Dirty Read ✔ ✘ ✘ ✘ Unrepeatable Read` ✔ ✔ ✘ ✘ Phantom Read ✔ ✔ ✘ ✘ Write Skew ✔ ✔ ✔ ✘ Under different transaction levels, developers can obtain different levels of consistency guarantees from the database. The higher the isolation level, the stronger the consistency provided, and at the same time, it means greater performance overhead. One of the benefits of this model is that it allows us to trade off between consistency and performance and make a choice that suits our application scenario. In distributed scenarios, we face more complex consistency issues. To facilitate the following discussion, let&#39;s first introduce a few consistency models. Eventual Consistency At a certain point in time, the states between the replicas in the database system may be inconsistent. The consistency problems we have seen above are all solved by the application layer, and the database itself only provides the following guarantee: After an arbitrarily long period of time, all replicas in the database can eventually converge (convergence) to the same state. This very weak consistency guarantee is what we often call eventual consistency: ProsThis weak consistency guarantee makes the system design more flexible, so that higher performance can be achieved.For example: asynchronous replication strategies are used between replicas, and special reconciliation systems are designed to resolve data conflicts offline... Cons When eventual consistency is involved in system design, the application layer needs to pay close attention to the impact of replication lag on the system.And it is necessary to design the system according to the consistency guarantee required by the business, which increases the workload of application developers in disguise.In addition, some problems will only be exposed when network errors or high concurrency occur, which is difficult to test. Linearizability The transaction mechanism of a database itself is a fault-tolerant protocol that can provide data security guarantees for applications that run based on transactions. In order to hide the complexity from the application layer, transactions provide the following abstract guarantees to the application: Atomicity: The data in the database is complete, and the transaction execution is complete (no need to worry about process crashes during execution) Isolation: The database will not be modified concurrently, and transactions will not affect each other (no need to worry about race conditions affecting the execution results) Durability: The storage of the database is reliable, and the changes of the transaction will not be lost (no need to worry about data loss caused by storage failures) The abstract guarantees provided by the transaction mechanism liberate application developers from complex error handling and allow them to focus on business logic. This not only improves development efficiency but also reduces the probability of bugs, making the system more stable and easier to test. In an ideal situation, we hope that distributed databases can provide us with a stronger consistency guarantee, like transactions: Global write-after-read consistency The system only exposes one data to the outside world, and there is no problem of multiple versions of data existing at the same time All modification operations are atomic, and the data read each time is the latest Global monotonic read consistencyOnce the write operation is successful, the result will be visible to all subsequent read operations, and the old data will not be read This cross-process global strong consistency guarantee is called Linearizability. Next, we will introduce this model in detail through some specific cases. First, we assume that $x$ is an entry in the database: In a key-value database, $x$ is a key In a relational database, $x$ is a row In a document database, $x$ is a document The linearizability model defines three basic operations: $\\text{read}(x) \\Rightarrow v$：The client reads the value $v$ corresponding to $x$ from the database $\\text{write}(x, v) \\Rightarrow r$：The client writes the value $v$ corresponding to $x$ to the database and returns the operation result $r$ $\\text{cas}(x, v_1, v_2) \\Rightarrow r$：The client uses the CAS(compare-and-set) operation to modify the value of $x$ from $v_1$ to $v_2$ then return the and return the operation result $r$ Linearizability is cross-process and can be used as the basis for implementing the following distributed application scenarios: Distributed locks and election Use CAS operations to implement locks, and the node that obtains the lock is the leader Uniqueness constraints Use CAS operations to obtain the lock corresponding to a value. If the lock is obtained successfully, then the value is unique, otherwise the value is not unique Temporal dependencies between multiple channels After process A successfully modifies the data and notifies process B, process B can definitely obtain the modification result of process A A scenario that satisfies linearizability There are 3 clients in the picture, among which clients A and B read $x$, and client C writes $x$: When B reads $x$ for the first time, C is performing a write operation At this time B reads the value 0 (C write operation has not been committed at this time) When A reads $x$ for the second time, C is performing a write operation At this time A reads the value 1 (C write operation has been committed at this time) When B reads $x$ for the second time Since A read x result is 1 before, B will read the value 1 (global monotonic read consistency) A scenario that violates linearizability There are 4 clients A, B, C, and D in the picture, which perform read and write operations concurrently. The lines in the picture indicate the time points when the transaction is committed and the read operation actually occurs. There is a behavior that violates linearizability in the picture: B reads 2 after A reads 4 Although from the perspective of client B itself, it does not violate monotonic read consistency, but globally it violates monotonic read consistency:The result of the later B read request lags behind the result of the earlier A read request A practical application scenario [Thumbnail Image Generator] The diagram shows a multi-replica distributed file storage called FileStorage, which is used to store user photo data. The backend needs to generate thumbnails to speed up web preview: When a user uploads or modifies a photo, the WebServer stores the original-sized user image in FileStorage. The image ID is asynchronously notified to ImageResizer via MQ. ImageResizer retrieves the data from FileStorage and generates a thumbnail based on the image ID provided by MQ. During step B, FileStorage performs replica replication while the MQ message is being delivered. If FileStorage does not meet linearizability, ImageResizer may not be able to read the image (violating global write-after-read) or read an old (violating global monotonic read) image. This can lead to processing failures, or even the generation of incorrect thumbnails, leaving the entire system in an inconsistent state. Implementations The simplest way to implement linearizability semantics is to use only one replica of the data, but this will make the system not fault-tolerant. In order to improve the fault tolerance of the system, a multi-replica architecture is the only choice. The following discusses different cases according to different multi-replica architectures. Single-leader Without using snapshot isolation (for example: MySQL&#39;s MVCC), using the following two strategies can meet linearizability: Read and write data from the leader (only access the leader&#39;s replica data, and avoid being affected by inconsistent replica data of other followers) Use synchronous replication strategy (asynchronous replication cannot guarantee that follower replicas will eventually be consistent with the leader) Risk points: Multiple leaders may appear during a split-brain (multiple writable replica data are exposed to the outside world at the same time, and data inconsistency will eventually occur) may violating linearizability The choice of a new leader replica for automatic failover (if a replica with incomplete data is selected as the new leader, it is equivalent to data loss) may violating linearizability Multi-leader It allows multiple nodes to write at the same time, and asynchronous replication needs to be supported, which may cause write conflicts. Therefore, multiple replicas need to be exposed to resolve conflicts, so it is definitely impossible to meet linearizability. Consensus algorithms Consensus algorithms cover the functions of single-leader, and also have mechanisms to prevent split-brain and expired replicas, so they naturally meet linearizability. Performance trade-offs Although linearizability is a powerful consistency guarantee, such strong consistency models are not widely used in practice. For example, the memory model of modern computers does not guarantee linearizability: In order to improve system performance, modern CPUs use a multi-level cache architectureWhen the CPU needs to access and modify data in RAM, it will first modify the cache, and then asynchronously flush the modifications to the actual RAM (multi-replica + asynchronous replication mechanism) Sacrificing consistency for better performance is more common in database systems. To ensure strong consistency, the linear consistency model will bring poor performance. Causal Consistency The order in which events occur contains causality. A scenario that violates causal consistency [Doctor&#39;s schedule] Each hospital has an on-call shift schedule to ensure that there is at least one on-call doctor on duty to deal with emergencies. If the on-call doctor is unwell on the day of duty, they can apply for early leave in the scheduling system. The scheduling system will check the current number of on-call doctors and determine whether to allow them to leave work. One day, there are only two on-call doctors in the hospital, Alice and Bob, but both of them happen to be unwell and apply for early leave at the same time. The following may happen: The system first starts two concurrent transactions, initiated by Alice and Bob respectively Both transactions query the number of on-call personnel, and find that the number of on-call personnel is 2 (currently_on_call = 2) Both transactions update the on-call record at the same time, setting Alice and Bob to non-on-call status, and successfully committing the transaction In the end, there are 0 on-call doctors in the hospital, and emergency patients R.I.P. This example violates causal consistency: The write operation in the transaction depends on the read operation Alice&#39;s write transaction is committed first, which causes the result of Bob&#39;s transaction read to become invalidHowever, Bob&#39;s transaction did not detect that the read was invalid, but directly committed the transaction, which eventually caused the system to violate the scheduling constraint This inconsistency caused by concurrent read-write transactions is called write skew. It is worth noting that this concurrent transaction situation is not necessarily caused by human factors, and the extended transaction cycle due to network latency may also indirectly trigger this problem. Model Comparison Let&#39;s first review two definitions related to order: Total order (total order/linear order) In a set, any two elements can be compared. Partial order (partial order) In a set, some elements can be compared. These two orders represent two consistency models: Linearizability The system only exposes one data to the outside world, and all operations are executed in series on the only data (no concurrent operations), so there must be a before-and-after order between any operations. Causal consistencyOperations that have causal relationships are ordered, but concurrent operations have no causal relationships, so there is no before-and-after order. The linearizability model is simpler and easier to understand, and it can handle causal problems of multi-channel time-order dependencies. However, implementing linearizability requires a high performance cost, operations need to wait for each other, and in a high network latency environment, the probability of system unavailability will increase. The causal consistency model is more abstract and difficult to understand, but it is sufficient to cope with most application scenarios. It can eventually reach the standard of the eventual consistency model, and it is insensitive to network latency, and can still guarantee availability in the face of network failures. One of the major differences between causal consistency and linearizability is that it allows concurrent access to unrelated data: Linearizability has only a single timeline. Causal consistency is a tree with multiple forks (you can refer to Git&#39;s branch model). Sequence Number Generation Causal relationship itself is an order problem, so as long as the order is known, causal relationship can be derived from it Before discussing the causal consistency model, we need to find a suitable method to represent the causal order, so that we can analyze causal dependencies. In practical applications, we cannot record all dependencies, otherwise it will cause huge overhead. A feasible method is: assign a sequence number to each operation to represent the order. The sequence number itself takes up less space and has a total order relationship. Common large-scale sequence number generation methods include: Timestamp: Use a high-precision timestamp as the sequence number. Manually Planning: Use modulo to divide the available sequence numbers according to the number of generators, and deploy multiple sequence number generation services (for example: two nodes can use the generators of odd and even sequence numbers respectively). Batch Generation: The generator allocates in batches, and each time it allocates a continuous range to the node. The problem is that these sequence numbers cannot guarantee global order: System clocks will deviate, and the clocks between multiple nodes may not be synchronized, and timestamps may not represent the order of operations. If the load of the service nodes is uneven, the old sequence number may be applied to the new operation, and the order cannot be guaranteed. Thus we need a sequence number generation mechanism that can guarantee global order. Lamport timestamps Let&#39;s first introduce a method of generating sequence numbers with causal relationships through logical clocks: Lamport timestamps. Each process needs to maintain the following two pieces of information: $\\texttt{ID}$：Globally unique and immutable process identifier. $\\texttt{Counter}$： A monotonically increasing integer counter with an initial value of 0. All interactions in the system will be encapsulated into a series of events, and each event will be associated with a globally unique sequence number $(\\texttt{C},\\texttt{ID})$ with following rules: When a process generates an event, it first increments the counter to get a locally unique sequence number $\\texttt{C}^{latest} = \\texttt{++Counter}$, and then combines it with the process identifier to form $(\\texttt{C}^{latest},\\texttt{ID})$ to represent the occurrence order of this event. When a process receives an event from another process, it updates the local counter $\\texttt{Counter} = \\max(\\texttt{C}^{other}, \\texttt{Counter})+1$ This sequence number satisfies the following total order relationship: The larger the C, the higher the priority; if the C is the same, the larger the ID, the higher the priority. By associating such a sequence number with each operation, an indirect total order relationship is established for all operations, so that any two operations are ordered. However, this scheme also has its drawbacks: the order of operations can only be known after the operation is initiated, and cannot immediately react to data conflicts. For example:Two clients simultaneously initiate conflicting operations on two different nodes (e.g., adding an account with the same name). When conflicts occur, the system automatically resolves conflicts by using the value with a larger sequence number, causing the operation with a smaller sequence number to be invalidated. However, from the client&#39;s perspective, their operation appears to have succeeded. Such conflicts that cannot be resolved in real-time may lead to consistency issues. Although the final sequence number is totally ordered, the real-time sequence at a certain moment is not complete, and there may be unknown sequence numbers inserted later. For example:A has generated the maximum sequence number $(\\texttt{A},1)$, and process B has generated the maximum sequence number $(\\texttt{B},5)$. If process A receives an event $(\\texttt{C},2)$ from process C at this moment, a subsequent event with the sequence number $(\\texttt{A},4)$ may be generated. To ensure the safety of operations (irreversibility), we need to ensure that the current known sequence remains unchanged. Total Order Broadcast Total order broadcast (TOB), also known as atomic broadcast, is a communication protocol that ensures all nodes in a distributed system receive messages in the same order. This means that each node will process messages in the same sequence, regardless of when or where they were received. TOB has two main properties: Reliability: Once a node receives a message, all other nodes will eventually receive the same message. Messages are not lost or duplicated. Total order: All nodes receive messages in the same order. This means that if node A receives message M1 before message M2, then node B will also receive M1 before M2. Once a message is sent, the order is determined. Nodes are not allowed to insert messages into an existing message sequence; they can only append to it. Therefore, total order broadcast can be seen as a process of logging: all nodes asynchronously record a globally consistent order of events. When failures occur, a retry mechanism is needed to ensure the above two constraints. TOB is a fundamental building block for many distributed systems, and it is used to implement a variety of features: Consistent Replication MechanismDatabase write operations are treated as messages. As long as all nodes receive these messages in the same order, the consistency of all replicas can be ensured. Serialized TransactionsTransaction operations are treated as messages. If each node processes transactions in the same order, then each node ensures a consistent state. Distributed LocksEach lock acquisition request is recorded in an ordered log, and the order of lock acquisition can be determined based on the order of requests. CAS OperationsTaking the example of using CAS(username, A, B) to modify a username: Send an assert(username = A) message Listen for related logs of username, and after receiving the first log: If it is a self-sent assert log, complete the modification by appending a commit(username = B) message, If it is an assert or commit log from another node, the modification fails. Global Consistency Reads Use the position of messages in the log to determine when a read occurred, send a message, wait for the return of this message, and then perform the read operation (etcd). If the log system can obtain the position of the latest log, it can wait for the log to be appended to this position before performing the read operation (ZooKeeper). Read data only from replicas where write operations are synchronously updated. Combining the last two functions is equivalent to achieving linearizability. Implementations In a single-leader architecture, there is only one leader node that is responsible for accepting write requests. This means that all write requests are naturally ordered, and all replicas can be kept consistent by simply replicating the writes from the leader. The single-leader architecture itself has the characteristic of total order delivery, as long as the reliable transmission problem is solved, total order broadcast can be achieved. In conventional single-leader architectures, a leader node needs to be manually designated at startup. Once this node fails, the entire system will become unavailable until manual intervention designates a new leader. This undoubtedly severely affects the system&#39;s availability. To achieve automatic failover, the system itself needs to support leader election functionality: When the leader fails, a new leader is selected from the healthy followers to continue providing services externally. During the election process, precautions need to be taken to prevent split-brain scenarios, avoiding the simultaneous occurrence of multiple leaders, which would affect system consistency. In such election scenarios, distributed consensus algorithms are inevitably used. We will discuss them in a separate section later. "}]