不同的近似优化算法的误差和时间复杂度

前注： \(\mathcal{E}{app} = (approximation error)\) \( \mathcal{E}{\text {est }} =(estimation error) \mathcal{E}{\mathrm{opt}} \quad= (optimization error) T \quad =(computation time)\)
\( \ell(f(x), y) 就是loss function\) n:sample size ， d:feature number，\( \rho\)为事先定义好的可以容忍的误差最大值
优化算法优化的对象是:
\(E(f)=\int l(f(x), y) d P(x, y)=E[l(f(x), y)]\) 也就是求解 \(f^{*}=\arg \min _{f} E[l(\hat{y}, y) | x]\)
我们的学习过程实际上是根据训练数据从一组函数F 内选择出函数 \(f{n}=\operatorname{argmin}{f} E{n}[f]\) 定义 \(f^f\) 为所有可能的最优的函数(可能不在F里面)，
再定义 \( f_{F}^{}=\operatorname{argmin}{f} E[f]\)就是F里面最优函数，则
\(E\left[E\left(f{n}\right)-E\left(f^{}\right]\right]=E\left[E\left(f_{F}^{}\right)-E\left(f^{}\right)\right]+E\left[E\left(f_{n}\right)-E\left(f_{F}^{}\right)\right]=e_{a p p}+e_{e s t}\)
\(e_{a p p}\)表示 approximation error（F FF与最优解的差距）, \(e_{e s t}\) estimation error（由于训练数据和优化算法得到的函数与F FF内最优函数的差距）
找到最优的\(f_nf\) 需要很复杂的计算，而我们不需要找到最优的\(f_nf\),因为\(E_{n}(f)\) 本身就是近似的，所以我们可以在优化算法收敛前提前终止迭代。
假设我们得到的近似解\( \hat{f}{n}\)满足 \(E{n}\left(\hat{f}{n}\right)<E{n}\left(f_{n}\right)+\rho\) 那么：
\(E\left[E\left(\hat{f}{n}\right)-E\left(f^{*}\right]\right]=E\left[E\left(f{F}^{}\right)-E\left(f^{}\right)\right]+E\left[E\left(f_{n}\right)-E\left(f_{F}^{*}\right)\right]+E\left[E\left(\hat{f}{n}\right)-E\left(f{n}\right)\right]=e_{\alpha p p}+e_{e s t}+e_{o p t}\) 多出一个optimization error。

general condition bound :
\( \mathcal{E}{\mathrm{app}}+\mathcal{E}{\mathrm{est}} \leq c\left(\mathcal{E}{\mathrm{app}}+\left(\frac{d}{n} \log \frac{n}{d}\right)^{\alpha}\right) \quad \text { for } \frac{1}{2} \leq \alpha \leq 1\)
\( \alpha\)就是learning rate ，例如梯度下降里面那个\( \alpha\) 所以总的error为：
\( \mathcal{E}=\mathcal{E}{\mathrm{app}}+\mathcal{E}{\mathrm{est}}+\mathcal{E}{\mathrm{opt}}=\mathbb{E}\left[E\left(\tilde{f}{n}\right)-E\left(f^{*}\right)\right] \leq c\left(\mathcal{E}{\mathrm{app}}+\left(\frac{d}{n} \log \frac{n}{d}\right)^{\alpha}+\rho\right)\)
各种梯度下降的复杂度：
\( \begin{array}{lcccc}
\hline \text { Algorithm } & \begin{array}{c}
\text { Cost of one } \
\text { iteration }
\end{array} & \begin{array}{c}
\text { Iterations } \
\text { to reach } \rho
\end{array} & \begin{array}{c}
\text { Time to reach } \
\text { accuracy } \rho
\end{array} & \begin{array}{c}
\text { Time to reach } \
\mathcal{E} \leq c\left(\mathcal{E}_{\text {app }}+\varepsilon\right)
\end{array} \
\hline \text { GD } & \mathcal{O}(n d) & \mathcal{O}\left(\kappa \log \frac{1}{\rho}\right) & \mathcal{O}\left(n d \kappa \log \frac{1}{\rho}\right) & \mathcal{O}\left(\frac{d^{2} \kappa}{\varepsilon^{1 / \alpha}} \log ^{2} \frac{1}{\varepsilon}\right) \
2 \text { GD } & \mathcal{O}\left(d^{2}+n d\right) & \mathcal{O}\left(\log \log \frac{1}{\rho}\right) & \mathcal{O}\left(\left(d^{2}+n d\right) \log \log \frac{1}{\rho}\right) & \mathcal{O}\left(\frac{d^{2}}{\varepsilon^{1 / \alpha}} \log \frac{1}{\varepsilon} \log \log \frac{1}{\varepsilon}\right) \
\text { SGD } & \mathcal{O}(d) & \frac{\nu \kappa^{2}}{\rho}+o\left(\frac{1}{\rho}\right) & \mathcal{O}\left(\frac{d \nu \kappa^{2}}{\rho}\right) & \mathcal{O}\left(\frac{d \nu \kappa^{2}}{\varepsilon}\right) \
2 \text { SGD } & \mathcal{O}\left(d^{2}\right) & \frac{\nu}{\rho}+o\left(\frac{1}{\rho}\right) & \mathcal{O}\left(\frac{d^{2} \nu}{\rho}\right) & \mathcal{O}\left(\frac{d^{2} \nu}{\varepsilon}\right) \
\hline
\end{array}\)
结论：The SGD and 2SGD results do not depend on the estimation rate \( \alpha\).
Second order algorithms bring little asymptotical improvements in e.
Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite being the worst optimization algorithms.