深度學習：使用激勵函數的目的、如何選擇激勵函數 Deep Learning : the role of the activation function

二、激勵函數的選擇：為何 ReLU 勝出？

ReLU的分段線性性質能有效的克服梯度消失的問題。

http://neuralnetworksanddeeplearning.com/chap5.html

2. 類神經網路的稀疏性（奧卡姆剃刀原則）

Relu會使部分神經元的輸出為0，可以讓神經網路變得稀疏，緩解過度擬合的問題。

4. 計算量節省

Relu 計算量小，只需要判斷輸入是否大於0，不用指數運算。

[補充1] Universal approximation theorem

Universal approximation theorem：用一層隱藏層的神經網絡，若使用的激勵函數具有單調遞增、有上下界、非常數且連續的性質，則總是存在一個擁有有限N個神經元的單隱藏層神經網絡可以無限逼近這個連續函數（鮑萊耳可測函數）。

References

What is the best multi-stage architecture for object recognition? Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009a)

Deep sparse rectifier neural networks Glorot, X., Bordes, A., and Bengio, Y. (2011b).

Maxout networks. Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a).

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Maxout Networks. Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, Yoshua Bengio

Cybenko., G. (1989) Approximations by superpositions of sigmoidal functions

https://zhuanlan.zhihu.com/p/25110450

https://zhuanlan.zhihu.com/p/22561439

https://www.zhihu.com/question/29021768

Neural Networks and Deep Learning - A visual proof that neural nets can compute any function
http://neuralnetworksanddeeplearning.com/chap4.html

Neural Networks and Deep Learning - Why are deep neural networks hard to train?
http://neuralnetworksanddeeplearning.com/chap5.html#discussion_why