Tuesday, February 13, 2018

Neural Network Notes

Some miscellaneous neural network terms. I hope to expand on all of them later.

Deep Learning

"Taking complex raw data and creating higher-order features automatically in order to make a simpler classification (or regression) output is a hallmark of deep learning... The best way to take advantage of this power is to match the input data to the appropriate deep network architecture." [1]

Architecture of ANNs

"We can use an arbitrary number of neurons to define a layer and there are no rules about how big or small this number can be. However, how complex of a problem we can model is directly correlated to how many neurons are in the hidden layers of our networks. This might push you to begin with a large number of neurons from the start but these neurons come with a cost... There are also cases in which a larger model will sometimes converge easier because it will simply 'memorize' the training data." [1]

"Hidden layers are concerned with extracting progressively higher-order features from the raw data."

"A more continuous distribution of input data is generally best modelled with a ReLU activation function. Optionally, we'd suggest using the tanh activation function (if the network isn't very deep) in the event that the ReLU did not achieve good results (with the caveat that there could be other hyperparameter-related issues with the network." [1]

"Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don't need any hidden layers at all…One hidden layer is sufficient for the large majority of problems." (StackOverflow)

In the output layer for binary classification "we'd use a sigmoid output layer with a single neuron to give us a real value in the range of 0.0 to 1.0". [1]

"the best way to build a neural network model: Cause it to overfit and then regularize it to death... Regularization works by adding an extra term to the normal gradient computed."

The right model

"If we have a mutliclass modeling problem yet we only care about the best score across these classes, we'd use a softmax output layer with an arg-max() function to get the highest score of all the classes. The softmax output layer gives us a probability distribution over all the classes" [1]

"If we want to get multiple classifications per output (eg person + car), we do not want softmax as an output layer. Instead, we'd use the sigmoid output layer with n number of neurons, giving us a probability distribution (0.0 to 1.0) for every class independently". [1]

"In certain architectures of deep networks, reconstruction loss functions help the network extract features more effectively when paired with the appropriate activation function. An example of this would be using the multiclass cross-entropy as a loss function in a layer with a softmax activation function for classification output." [1] However, note that cross-entropy is not symmetric so might wrongly favour certain values over others even if they're equally wrong [see this StackOverflow answer].

Optimizations

"We define a hyperparameter as any configuration setting that is free to be chosen by the user that might affect [sic] performance." [1] For example layer size, activation functions, loss functions, epochs etc.

"And so on, until we've exhausted the training inputs, which is said to complete an epoch of training."
http://neuralnetworksanddeeplearning.com/chap1.html

"First-order optimization algorithms calculate the Jacobian matrix... Second-order algorithms calculate the derivative of the Jacobian (ie, the derivative of a matrix of derivatives) by approximating the Hessian." [1]

"A major difference in first- and second-order methods is that second-order methods converge in fewer steps yet take more computation per step". [1]

"The 'vanilla' version of SGD uses gradient directly, and this can be problematic because gradient can be nearly zero fir any parameter. This causes SGD to take tiny steps in some cases, and steps that are too big for situations in which the gradient is too large. To alleviate these issues, we can use the technique such as:

"AdaGrad is monotonically decreasing and never increases the learning rate" [1]

Autoencoders

"An autoencoder is trained to reproduce its own input data."[1]

"The key difference to note between a multilayer perceptron network diagram (from earlier chapters) and an autoencoder diagram is the output layer in an autoencoder has the same number of units as the input layer does." [1]

"Autoencoders are good at powering anomaly detection systems."

Mini-batches

If the cost function is C, then "stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient ∇C by computing ∇Cx for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient ∇C, and this helps speed up gradient descent, and thus learning.

This "works by randomly picking out a small number mm of randomly chosen training inputs. We'll label those random training inputs X1,X2,…,Xm and refer to them as a mini-batch. Provided the sample size m is large enough we expect that the average value of the ∇CXj will be roughly equal to the average over all  ∇Cx" (from neuralnetworksanddeeplearning.com).

Boltzmann machines

Defined as "a network of symmetrically connected, neuron-like units that make stochastic decisions about whether to be on or off". [1]

"The main difference between RBMs and the more general class of autoencoders is in how they calculate the gradients." [1]

Regularization

"Dropout and DropConnect mute parts of the input to each layer such that the neural network learns other portions".

ReLU

ReLU - rectified linear units ReLU "are the current state of the art because they have proven to work in many situations. Because the gradient of a ReLU are either zero or constant, it is possible to reign in the vanishing exploding gradient issue. ReLU activation functions have shown [sic] to train better in practice than sigmoid functions".  [1]

Leaky ReLUs

"Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so)" (from here).

[1] Deep Learning - a practitioners guide.

No comments:

Post a Comment