In neural networks , The activation function is responsible for converting the weighted input from a node to the activation of the node or output of that input .ReLU
Is a piecewise linear function , If the input is positive , It will output directly , otherwise , It will output zero . It has become the default activation function of many types of neural networks , Because its model is easier to train , And usually get better performance . In this article , Let's go into detail ReLU, It is mainly divided into the following parts :

1,Sigmoid and Tanh Limitations of activation functions

2,ReLU(Rectified Linear Activation Function)

3, How to implement ReLU

4,ReLU Advantages of

5, apply ReLU Tips for

1. Sigmoid and Tanh Limitations of activation functions

A neural network consists of layer nodes , And learn to map input samples to output . For a given node , Multiply the input by the weight in the node , And add them together . This value is called the summed
activation. then , A function that converts and defines a specific output or node through an activation function “activation”.

The simplest activation function is called linear activation , There are no transformations applied at all .
  A network consisting only of linear activation functions is easy to train , But you can't learn complex mapping functions . The linear activation function is still used to predict the output layer of a number of networks ( For example, regression problem ).

The nonlinear activation function is better , Because they allow nodes to learn more complex structures in the data  . Two widely used nonlinear activation functions are sigmoid  Functions and hyperbolic tangents   Activate function .

Sigmoid Activate function  , Also known as
Logistic Functional neural network , Traditionally, it is a very popular neural network activation function . The input of the function is converted to between 0.0 and 1.0 Values between . greater than 1.0 The input of is converted to a value 1.0, same , less than 0.0 The value of is broken to 0.0. All possible input functions have shapes from 0 reach 0.5 reach 1.0 of
s shape . For a long time , until 20 century 90 Early S , This is the default activation mode of neural network .

Hyperbolic tangent function  , for short tanh, Is a nonlinear activation function with similar shape , Output value between -1.0 and 1.0 between . stay 20 century 90 Late S and 21 Early 20th century , Due to use tanh
Function model is easier to train , And often has better prediction performance , therefore tanh Function ratio Sigmoid Activation function is preferred .

Sigmoid and tanh A common problem with functions is that their ranges are saturated  . That means , Large value suddenly becomes 1.0, Small value suddenly becomes
-1 or 0. in addition , A function is only sensitive to changes around its input midpoint .

Whether or not the summation activation provided by the node as input contains useful information , The sensitivity and saturation of the function are finite . Once saturated , The learning algorithm needs to constantly adjust the weights to improve the performance of the model .

last , With the improvement of hardware capability , adopt gpu Very deep neural network use Sigmoid and tanh
Activation function is not easy to train . These nonlinear activation functions can not receive useful gradient information in large networks . Errors propagate back through the network , And used to update weights . Every additional layer , The number of errors will be greatly reduced . This is what we call
Vanishing gradient problem , It can effectively prevent deep ( multilayer ) Network learning .

Although the use of nonlinear activation functions allows neural networks to learn complex mapping functions , But they effectively prevent the work of learning algorithms and deep networks . stay 2000 Late S and 2010 Early S , By using alternative network types such as Boltzmann machines and layered training or unsupervised pre training , That's the solution .

2. ReLU(Rectified Linear Activation Function)

To train deep neural networks , Need an activation function neural network , It looks and behaves like a linear function , But it's actually a nonlinear function , Allow learning of complex relationships in data
 . The function must also provide more sensitive activation and input , Avoid saturation .

therefore ,ReLU Appears , adopt ReLU It can be one of the few milestones in the deep learning revolution
 .ReLU The activation function is a simple calculation , If input is greater than 0, Returns directly the value supplied as input ; If input is 0 Or less , Return value 0.

We can use a simple if-statement To describe the problem , As follows :
if input > 0: return input else: return 0

For values greater than zero , This function is linear , This means that when back propagation is used to train the neural network , It has many ideal properties of linear activation functions . however , It is a nonlinear function , Because negative values are always output as zero . Since the correction function is linear in half of the input field , The other half is nonlinear , So it's called
Piecewise linear function (piecewise linear function ) .

3. How to implement ReLU

We can easily Python Implementation in ReLU Activate function .
# rectified linear function def rectified(x): return max(0.0, x)
We want any positive value to return unchanged , and 0.0 Or negative values will be entered as 0.0 return .

Here are some examples of the input and output of the modified linear activation function :
# demonstrate the rectified linear function # rectified linear function def
rectified(x): return max(0.0, x) # demonstrate with a positive input x = 1.0
print('rectified(%.1f) is %.1f' % (x, rectified(x))) x = 1000.0
print('rectified(%.1f) is %.1f' % (x, rectified(x))) # demonstrate with a zero
input x = 0.0 print('rectified(%.1f) is %.1f' % (x, rectified(x))) #
demonstrate with a negative input x = -1.0 print('rectified(%.1f) is %.1f' %
(x, rectified(x))) x = -1000.0 print('rectified(%.1f) is %.1f' % (x,
The output is as follows :
rectified(1.0) is 1.0 rectified(1000.0) is 1000.0 rectified(0.0) is 0.0
rectified(-1.0) is 0.0 rectified(-1000.0) is 0.0
We can plot a series of inputs and calculated outputs , Get the relationship between the input and output of the function . The following example generates a series of
-10 reach 10 Integer of , And calculate the corrected linear activation for each input , Then draw the results .
# plot inputs and outputs from matplotlib import pyplot # rectified linear
function def rectified(x): return max(0.0, x) # define a series of inputs
series_in = [x for x in range(-10, 11)] # calculate outputs for our inputs
series_out = [rectified(x) for x in series_in] # line plot of raw inputs to
rectified outputs pyplot.plot(series_in, series_out)
Running this example will create a graph , Shows that all negative and zero inputs change to 0.0, Positive output returns as is :

ReLU The derivative of a function is the slope . The slope of a negative value is 0.0, The slope of a positive value is 1.0.

Traditionally , The field of neural networks can no longer be any incomplete differentiable activation function , and ReLU Is a piecewise function . Technically speaking , When input is 0.0 Hour , We can't calculate ReLU Derivative of , however , We can assume that it is 0.

4. ReLU Advantages of

4.1. Computational simplicity

tanh and sigmoid Activation function requires exponential calculation , and ReLU it needs only max(), So his calculations are simpler , Lower computing costs  .

4.2. Sparse representation

ReLU An important benefit of , It can output a true zero value  . This and tanh and sigmoid
Activation function is different , The latter learning approximates zero output , For example, a value very close to zero , But not really zero . This means that a negative input can output a true zero value , Allow hidden layer activation in neural networks to contain one or more true zero values . This is called sparse representation , Is an ideal property , Learning at , Because it can accelerate learning and simplify the model .

4.3. Linear behavior

ReLU Looks more like a linear function , generally speaking , When the behavior of neural network is linear or nearly linear , It is easier to optimize  .

The key to this feature is , The network trained with this activation function almost completely avoids the problem of gradient disappearance , Because the gradient is still proportional to node activation .

4.4. Training depth network

ReLU The emergence of makes it possible to successfully train deep multi-layer networks with nonlinear activation functions by using hardware upgrade and back propagation  .

5. apply ReLU Tips for

5.1. apply ReLU As default activation function

For a long time , The default activation method is Sigmoid Activate function . later ,Tanh Becomes an activation function . For modern deep learning neural networks , The default activation function is ReLU Activate function  .

5.2. right MLPs,CNNs apply ReLU, But not RNNs

ReLU It can be used in most types of neural networks , It is usually used as the activation function of multilayer perceptron neural network and convolution neural network  , And it has also been confirmed by many papers . Traditionally ,LSTMs apply
tanh Activate function to activate cell state , apply Sigmoid Activate function as node output . and ReLU Usually not suitable RNN Use of type networks .

5.3. Try smaller bias Input value

Offset is an input with a fixed value on the node , This offset affects the offset of the active function , Traditionally, the offset input value is set to 1.0. When used in the network ReLU Hour ,
You can set the deviation to a small value , for example 0.1 .

5.4. apply “He Weight Initialization”

Before training the neural network , The weight of the network must be initialized to a small random value . When used in the network ReLU
When the weight is initialized to a small random value centered on zero , By default , Half the cells in the network will output a zero value . There are many heuristic methods to initialize the weights of neural networks , But there is no optimal weight initialization scheme .
Hekaiming's article points out that Xavier Initialization and other schemes are not suitable for ReLU , right Xavier Initialize with a small change , Make it suitable for ReLU, propose He Weight
Initialization, This method is more applicable to ReLU .

5.5. Scale input data

It is a good practice to scale the input data before using neural networks . This may involve standardized variables , With zero mean and unit variance , Or normalize each value to 0 reach 1. If you do not scale data for many problems , The weight of the neural network may increase , So the network is unstable and the generalization error is increased .
Whether used in the network or not ReLU, This good practice for scaling input applies .

5.6. Use penalty weights

ReLU The output of is unbounded on a positive field . This means that in some cases , Output can continue to grow . therefore , Using some form of weight regularization may be a good method , such as l1 or l2 Vector norm .
This is important to improve the sparse representation of the model ( For example, use l 1 Regularization ) And reducing generalization error is a good method  .