In neural networks , The activation function is responsible for converting the weighted input from a node to the activation of the node or output of that input .ReLU
Is a piecewise linear function , If the input is positive , It will output directly , otherwise , It will output zero . It has become the default activation function of many types of neural networks , Because its model is easier to train , And usually get better performance . In this article , Let's go into detail ReLU, It is mainly divided into the following parts ：

1,Sigmoid and Tanh Limitations of activation functions

2,ReLU（Rectified Linear Activation Function）

3, How to implement ReLU

5, apply ReLU Tips for

1. Sigmoid and Tanh Limitations of activation functions

A neural network consists of layer nodes , And learn to map input samples to output . For a given node , Multiply the input by the weight in the node , And add them together . This value is called the summed
activation. then , A function that converts and defines a specific output or node through an activation function “activation”.

The simplest activation function is called linear activation , There are no transformations applied at all .
A network consisting only of linear activation functions is easy to train , But you can't learn complex mapping functions . The linear activation function is still used to predict the output layer of a number of networks ( For example, regression problem ).

The nonlinear activation function is better , Because they allow nodes to learn more complex structures in the data  . Two widely used nonlinear activation functions are sigmoid  Functions and hyperbolic tangents   Activate function .

Sigmoid Activate function  , Also known as
Logistic Functional neural network , Traditionally, it is a very popular neural network activation function . The input of the function is converted to between 0.0 and 1.0 Values between . greater than 1.0 The input of is converted to a value 1.0, same , less than 0.0 The value of is broken to 0.0. All possible input functions have shapes from 0 reach 0.5 reach 1.0 of
s shape . For a long time , until 20 century 90 Early S , This is the default activation mode of neural network .

Hyperbolic tangent function  , for short tanh, Is a nonlinear activation function with similar shape , Output value between -1.0 and 1.0 between . stay 20 century 90 Late S and 21 Early 20th century , Due to use tanh
Function model is easier to train , And often has better prediction performance , therefore tanh Function ratio Sigmoid Activation function is preferred .

Sigmoid and tanh A common problem with functions is that their ranges are saturated  . That means , Large value suddenly becomes 1.0, Small value suddenly becomes
-1 or 0. in addition , A function is only sensitive to changes around its input midpoint .

Whether or not the summation activation provided by the node as input contains useful information , The sensitivity and saturation of the function are finite . Once saturated , The learning algorithm needs to constantly adjust the weights to improve the performance of the model .

last , With the improvement of hardware capability , adopt gpu Very deep neural network use Sigmoid and tanh
Activation function is not easy to train . These nonlinear activation functions can not receive useful gradient information in large networks . Errors propagate back through the network , And used to update weights . Every additional layer , The number of errors will be greatly reduced . This is what we call
Vanishing gradient problem , It can effectively prevent deep ( multilayer ) Network learning .

Although the use of nonlinear activation functions allows neural networks to learn complex mapping functions , But they effectively prevent the work of learning algorithms and deep networks . stay 2000 Late S and 2010 Early S , By using alternative network types such as Boltzmann machines and layered training or unsupervised pre training , That's the solution .

2. ReLU（Rectified Linear Activation Function）

To train deep neural networks , Need an activation function neural network , It looks and behaves like a linear function , But it's actually a nonlinear function , Allow learning of complex relationships in data
. The function must also provide more sensitive activation and input , Avoid saturation .

therefore ,ReLU Appears , adopt ReLU It can be one of the few milestones in the deep learning revolution
.ReLU The activation function is a simple calculation , If input is greater than 0, Returns directly the value supplied as input ; If input is 0 Or less , Return value 0.

We can use a simple if-statement To describe the problem , As follows :
if input > 0: return input else: return 0

For values greater than zero , This function is linear , This means that when back propagation is used to train the neural network , It has many ideal properties of linear activation functions . however , It is a nonlinear function , Because negative values are always output as zero . Since the correction function is linear in half of the input field , The other half is nonlinear , So it's called
Piecewise linear function （piecewise linear function ） .

3. How to implement ReLU

We can easily Python Implementation in ReLU Activate function .
# rectified linear function def rectified(x): return max(0.0, x)
We want any positive value to return unchanged , and 0.0 Or negative values will be entered as 0.0 return .

Here are some examples of the input and output of the modified linear activation function ：
# demonstrate the rectified linear function # rectified linear function def
rectified(x): return max(0.0, x) # demonstrate with a positive input x = 1.0
print('rectified(%.1f) is %.1f' % (x, rectified(x))) x = 1000.0
print('rectified(%.1f) is %.1f' % (x, rectified(x))) # demonstrate with a zero
input x = 0.0 print('rectified(%.1f) is %.1f' % (x, rectified(x))) #
demonstrate with a negative input x = -1.0 print('rectified(%.1f) is %.1f' %
(x, rectified(x))) x = -1000.0 print('rectified(%.1f) is %.1f' % (x,
rectified(x)))
The output is as follows ：
rectified(1.0) is 1.0 rectified(1000.0) is 1000.0 rectified(0.0) is 0.0
rectified(-1.0) is 0.0 rectified(-1000.0) is 0.0
We can plot a series of inputs and calculated outputs , Get the relationship between the input and output of the function . The following example generates a series of
-10 reach 10 Integer of , And calculate the corrected linear activation for each input , Then draw the results .
# plot inputs and outputs from matplotlib import pyplot # rectified linear
function def rectified(x): return max(0.0, x) # define a series of inputs
series_in = [x for x in range(-10, 11)] # calculate outputs for our inputs
series_out = [rectified(x) for x in series_in] # line plot of raw inputs to
rectified outputs pyplot.plot(series_in, series_out) pyplot.show()
Running this example will create a graph , Shows that all negative and zero inputs change to 0.0, Positive output returns as is ：

ReLU The derivative of a function is the slope . The slope of a negative value is 0.0, The slope of a positive value is 1.0.

Traditionally , The field of neural networks can no longer be any incomplete differentiable activation function , and ReLU Is a piecewise function . Technically speaking , When input is 0.0 Hour , We can't calculate ReLU Derivative of , however , We can assume that it is 0.

4.1. Computational simplicity

tanh and sigmoid Activation function requires exponential calculation , and ReLU it needs only max(), So his calculations are simpler , Lower computing costs  .

4.2. Sparse representation

ReLU An important benefit of , It can output a true zero value  . This and tanh and sigmoid
Activation function is different , The latter learning approximates zero output , For example, a value very close to zero , But not really zero . This means that a negative input can output a true zero value , Allow hidden layer activation in neural networks to contain one or more true zero values . This is called sparse representation , Is an ideal property , Learning at , Because it can accelerate learning and simplify the model .

4.3. Linear behavior

ReLU Looks more like a linear function , generally speaking , When the behavior of neural network is linear or nearly linear , It is easier to optimize  .

The key to this feature is , The network trained with this activation function almost completely avoids the problem of gradient disappearance , Because the gradient is still proportional to node activation .

4.4. Training depth network

ReLU The emergence of makes it possible to successfully train deep multi-layer networks with nonlinear activation functions by using hardware upgrade and back propagation  .

5. apply ReLU Tips for

5.1. apply ReLU As default activation function

For a long time , The default activation method is Sigmoid Activate function . later ,Tanh Becomes an activation function . For modern deep learning neural networks , The default activation function is ReLU Activate function  .

5.2. right MLPs,CNNs apply ReLU, But not RNNs

ReLU It can be used in most types of neural networks , It is usually used as the activation function of multilayer perceptron neural network and convolution neural network  , And it has also been confirmed by many papers . Traditionally ,LSTMs apply
tanh Activate function to activate cell state , apply Sigmoid Activate function as node output . and ReLU Usually not suitable RNN Use of type networks .

5.3. Try smaller bias Input value

Offset is an input with a fixed value on the node , This offset affects the offset of the active function , Traditionally, the offset input value is set to 1.0. When used in the network ReLU Hour ,
You can set the deviation to a small value , for example 0.1 .

5.4. apply “He Weight Initialization”

Before training the neural network , The weight of the network must be initialized to a small random value . When used in the network ReLU
When the weight is initialized to a small random value centered on zero , By default , Half the cells in the network will output a zero value . There are many heuristic methods to initialize the weights of neural networks , But there is no optimal weight initialization scheme .
Hekaiming's article points out that Xavier Initialization and other schemes are not suitable for ReLU , right Xavier Initialize with a small change , Make it suitable for ReLU, propose He Weight
Initialization, This method is more applicable to ReLU .

5.5. Scale input data

It is a good practice to scale the input data before using neural networks . This may involve standardized variables , With zero mean and unit variance , Or normalize each value to 0 reach 1. If you do not scale data for many problems , The weight of the neural network may increase , So the network is unstable and the generalization error is increased .
Whether used in the network or not ReLU, This good practice for scaling input applies .

5.6. Use penalty weights

ReLU The output of is unbounded on a positive field . This means that in some cases , Output can continue to grow . therefore , Using some form of weight regularization may be a good method , such as l1 or l2 Vector norm .
This is important to improve the sparse representation of the model ( For example, use l 1 Regularization ) And reducing generalization error is a good method  .

Technology
Daily Recommendation