... corporate and government research stops caring for a while until about mid 80s ...
... deep learning sort of "begins" ...
# Generate a random linear regression problem
np.random.seed(1)
n = 25
b0, b1, e = 4, 1, .2 * np.random.randn(n)
x = np.random.randn(n)
y = b0 + b1*x + e
# Plot the single variable (x) and response (y)
plt.figure(figsize=(6,2))
plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.ylim(0, 7)
print()
Image(url='https://i.stack.imgur.com/ddJFC.png', width=500)
Image(url='http://ww2.tnstate.edu/ganter/BIO-311-Ch12-Eq5a.gif', width=500)
Image(url='https://i.stack.imgur.com/bmg5Z.png', width=500)
from IPython.display import Image
#Image(url='https://www.cs.toronto.edu/~frossard/post/linear_regression/sgd.gif')
Image(url='https://alykhantejani.github.io/images/gradient_descent_line_graph.gif')
Linear regression is a type of neural network:
Image('media/network_architectures/Slide2.jpg', width=img_width)
Image('media/network_architectures/Slide3.jpg', width=img_width)
Logistic regression can also be expressed this way, just need to change the activation function:
Image('media/network_architectures/Slide4.jpg', width=img_width)
Image('media/network_architectures/Slide5.jpg', width=img_width)
Adding neurons is like adding extra models of the original form, in this case two logistic regressions:
Image('media/network_architectures/Slide6.jpg', width=img_width)
Image('media/network_architectures/Slide7.jpg', width=img_width)
Parameterization of one single sigmoid neuron can do relatively complex things: Sigmoid -> Step Function
Image('media/nn1.png', width=img_width)
Image('media/nn2.png', width=img_width)
More neurons means more "folds":
Image('media/nn3.png', width=img_width)
Image('media/nn4.png', width=img_width)
This type of reasoning was how ANNs have been shown to approximate any function
Image('media/nn5.png', width=500)
Takeaway: Through many "towers" like this, an ANN can approximate any 3D function with just 2 inputs.
These are much more common in deeper networks than sigmoid activation:
Image('media/relu.png', width=500)
Other activation functions worth mentioning: Leaky ReLU, softplus, tanh
If single-layer networks can approximate some output arbitrarily well, why does more layers lead to better performing models on many common tasks?
At the moment, nobody has a "complete" answer to this question though some theory on this is out there.
Here are at few decent reasons though:
Image(url='media/partition_3neuron.png', width=800)
Image(url='http://rinuboney.github.io/img/AI_system_parts.png')
aka Decision Surfaces
Different common models allow for a variety of ways to divide the input region into spaces where the predicted value will differ:
Image(url='http://scikit-learn.org/stable/_images/sphx_glr_plot_classifier_comparison_001.png')
Every neuron added into a single-layer network with 2 inputs gives you one line with which to split the input space.
Consider a 3-Neuron, 1-Layer network:
Image(url='media/231layernn.png', width=500)
3 neurons mean 3 separating lines within input space:
Image(url='media/partition_3neuron.png', width=500)
b = [-5, -5, -5]
w1 = [[1, -1, 0], [1, 1, -3]]
w2 = [.1, .1, .1]
y = get_network_response_surface(X, network_fn)
plot_network_response_surface(v, y)
What about this network (same number of neurons, one extra layer)?
Image(url='media/2211layernn.png', width=500)
First, to understand what this does it's easier to look at a 1 layer, 2 neuron network ..
Image(url='media/221layernn.png', width=500)
b = [0, 0]
w1 = [[1, 0], [0, 1]]
w2 = [.1, .1]
network_fn = lambda: get_one_layer_network(b, w1, w2)
y = get_network_response_surface(X, network_fn)
plot_network_response_surface(v, y)
Ok so back to the original question, what does adding a single neuron in a second layer do?
The 221 network we're talking about:
Image(url='media/2211layernn.png', width=500)
b1 = [0, 0]
b2 = [-15]
w1 = [[1, 0], [0, 1]]
w2 = [1, 1]
w3 = [1]
network_fn = lambda: get_two_layer_network(b1, b2, w1, w2, w3)
y = get_network_response_surface(X, network_fn)
plot_network_response_surface(v, y)
Carrying this out to larger numbers of neurons lowers, it has been shown [1] that bounds exist for the number of possible regions per parameter used in a network.
Given:
$l$ = number of layers
$d$ = number of inputs
$k$ = number of neurons ($\geq d$)
Multi-Layer NN response regions per parameter = $ \Omega(\lfloor\frac{k}{d}\rfloor^{(l - 1)} \frac{k^{(d-2)}}{l})$
Single-Layer NN response regions per parameter = $ O(l^{(d - 1)}k^{(d - 1)}) $
[1] - On the number of response regions of deep feed forward networks with piece-wise linear activations
Here are some examples of real-life networks, how much data they require, and what it takes to train them...
well one of them at least:
Image(url='http://www.image-net.org/challenges/LSVRC/2014/ILSVRC2012_val_00042692.png')
Accuracy vs Network Complexity in ImageNet LSVRC competitions (blob size is #Parameters):
Image(url='media/network_arch_over_time.svg', width=700)
When does deep learning apply to a problem?
Image(url='media/deep_learning_matrix.jpg', width=500)
Determining Quality of Product Descriptions
This was a small but interesting study on applying some ideas from deep learning to normal business problems:
Image(url='media/quid_results.png')
Possibly the most interesting applications of deep learning for small businesses lie in transfer learning.
Another lies in the exhaust of the tools for building models that actually match your problem ...
Breakfast Reading:
Best Book (free too):
The prettiest presentations ever: