原文: https://pytorch.org/tutorials/beginner/pytorch_with_examples.html
作者:賈斯汀·約翰遜
本教程通過獨立的示例介紹 PyTorch 的基本概念。
PyTorch 的核心是提供兩個主要功能:
我們將使用完全連接的 ReLU 網絡作為我們的運行示例。 該網絡將具有單個隱藏層,并且將通過最小化網絡輸出與真實輸出之間的歐幾里德距離來進行梯度下降訓練,以適應隨機數(shù)據(jù)。
在介紹 PyTorch 之前,我們將首先使用 numpy 實現(xiàn)網絡。
Numpy 提供了一個 n 維數(shù)組對象,以及許多用于操縱這些數(shù)組的函數(shù)。 Numpy 是用于科學計算的通用框架。 它對計算圖,深度學習或梯度一無所知。 但是,我們可以使用 numpy 操作手動實現(xiàn)通過網絡的前向和后向傳遞,從而輕松地使用 numpy 使兩層網絡適合隨機數(shù)據(jù):
# -*- coding: utf-8 -*-
import numpy as np
## N is batch size; D_in is input dimension;
## H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
## Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
## Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
Numpy 是一個很棒的框架,但是它不能利用 GPU 來加速其數(shù)值計算。 對于現(xiàn)代深度神經網絡,GPU 通常會提供 50 倍或更高的加速,因此遺憾的是,numpy 不足以實現(xiàn)現(xiàn)代深度學習。
在這里,我們介紹最基本的 PyTorch 概念:張量。 PyTorch 張量在概念上與 numpy 數(shù)組相同:張量是 n 維數(shù)組,而 PyTorch 提供了許多在這些張量上運行的功能。 在幕后,張量可以跟蹤計算圖和漸變,但它們也可用作科學計算的通用工具。
與 numpy 不同,PyTorch 張量可以利用 GPU 加速其數(shù)字計算。 要在 GPU 上運行 PyTorch Tensor,只需要將其轉換為新的數(shù)據(jù)類型。
在這里,我們使用 PyTorch 張量使兩層網絡適合隨機數(shù)據(jù)。 像上面的 numpy 示例一樣,我們需要手動實現(xiàn)通過網絡的正向和反向傳遞:
# -*- coding: utf-8 -*-
import torch
dtype = torch.float
device = torch.device("cpu")
## device = torch.device("cuda:0") # Uncomment this to run on GPU
## N is batch size; D_in is input dimension;
## H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
## Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
## Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.mm(w1)
h_relu = h.clamp(min=0)
y_pred = h_relu.mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum().item()
if t % 100 == 99:
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.t().mm(grad_y_pred)
grad_h_relu = grad_y_pred.mm(w2.t())
grad_h = grad_h_relu.clone()
grad_h[h < 0] = 0
grad_w1 = x.t().mm(grad_h)
# Update weights using gradient descent
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
在以上示例中,我們必須手動實現(xiàn)神經網絡的正向和反向傳遞。 對于小型的兩層網絡,手動實施反向傳遞并不是什么大問題,但是對于大型的復雜網絡而言,可以很快變得非常麻煩。
幸運的是,我們可以使用自動微分來自動計算神經網絡中的反向傳遞。 PyTorch 中的 autograd 軟件包正是提供了此功能。 使用 autograd 時,網絡的正向傳遞將定義計算圖; 圖中的節(jié)點為張量,邊為從輸入張量生成輸出張量的函數(shù)。 然后通過該圖進行反向傳播,可以輕松計算梯度。
這聽起來很復雜,在實踐中非常簡單。 每個張量代表計算圖中的一個節(jié)點。 如果x
是具有x.requires_grad=True
的張量,則x.grad
是另一個張量,其保持x
相對于某個標量值的梯度。
在這里,我們使用 PyTorch 張量和 autograd 來實現(xiàn)我們的兩層網絡。 現(xiàn)在我們不再需要手動通過網絡實現(xiàn)反向傳遞:
# -*- coding: utf-8 -*-
import torch
dtype = torch.float
device = torch.device("cpu")
## device = torch.device("cuda:0") # Uncomment this to run on GPU
## N is batch size; D_in is input dimension;
## H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
## Create random Tensors to hold input and outputs.
## Setting requires_grad=False indicates that we do not need to compute gradients
## with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
## Create random Tensors for weights.
## Setting requires_grad=True indicates that we want to compute gradients with
## respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y using operations on Tensors; these
# are exactly the same operations we used to compute the forward pass using
# Tensors, but we do not need to keep references to intermediate values since
# we are not implementing the backward pass by hand.
y_pred = x.mm(w1).clamp(min=0).mm(w2)
# Compute and print loss using operations on Tensors.
# Now loss is a Tensor of shape (1,)
# loss.item() gets the scalar value held in the loss.
loss = (y_pred - y).pow(2).sum()
if t % 100 == 99:
print(t, loss.item())
# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the gradient
# of the loss with respect to w1 and w2 respectively.
loss.backward()
# Manually update weights using gradient descent. Wrap in torch.no_grad()
# because weights have requires_grad=True, but we don't need to track this
# in autograd.
# An alternative way is to operate on weight.data and weight.grad.data.
# Recall that tensor.data gives a tensor that shares the storage with
# tensor, but doesn't track history.
# You can also use torch.optim.SGD to achieve this.
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after updating weights
w1.grad.zero_()
w2.grad.zero_()
在幕后,每個原始的 autograd 運算符實際上都是在 Tensor 上運行的兩個函數(shù)。 正向函數(shù)從輸入張量計算輸出張量。 向后函數(shù)接收相對于某個標量值的輸出張量的梯度,并計算相對于相同標量值的輸入張量的梯度。
在 PyTorch 中,我們可以通過定義torch.autograd.Function
的子類并實現(xiàn)forward和backward
函數(shù)來輕松定義自己的 autograd 運算符。 然后,我們可以通過構造實例并像調用函數(shù)一樣調用新的 autograd 運算符,并傳遞包含輸入數(shù)據(jù)的張量。
在此示例中,我們定義了自己的自定義 autograd 函數(shù)來執(zhí)行 ReLU 非線性,并使用它來實現(xiàn)我們的兩層網絡:
# -*- coding: utf-8 -*-
import torch
class MyReLU(torch.autograd.Function):
"""
We can implement our own custom autograd Functions by subclassing
torch.autograd.Function and implementing the forward and backward passes
which operate on Tensors.
"""
@staticmethod
def forward(ctx, input):
"""
In the forward pass we receive a Tensor containing the input and return
a Tensor containing the output. ctx is a context object that can be used
to stash information for backward computation. You can cache arbitrary
objects for use in the backward pass using the ctx.save_for_backward method.
"""
ctx.save_for_backward(input)
return input.clamp(min=0)
@staticmethod
def backward(ctx, grad_output):
"""
In the backward pass we receive a Tensor containing the gradient of the loss
with respect to the output, and we need to compute the gradient of the loss
with respect to the input.
"""
input, = ctx.saved_tensors
grad_input = grad_output.clone()
grad_input[input < 0] = 0
return grad_input
dtype = torch.float
device = torch.device("cpu")
## device = torch.device("cuda:0") # Uncomment this to run on GPU
## N is batch size; D_in is input dimension;
## H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
## Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
## Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6
for t in range(500):
# To apply our Function, we use Function.apply method. We alias this as 'relu'.
relu = MyReLU.apply
# Forward pass: compute predicted y using operations; we compute
# ReLU using our custom autograd operation.
y_pred = relu(x.mm(w1)).mm(w2)
# Compute and print loss
loss = (y_pred - y).pow(2).sum()
if t % 100 == 99:
print(t, loss.item())
# Use autograd to compute the backward pass.
loss.backward()
# Update weights using gradient descent
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
# Manually zero the gradients after updating weights
w1.grad.zero_()
w2.grad.zero_()
計算圖和 autograd 是定義復雜運算符并自動采用導數(shù)的非常強大的范例。 但是對于大型神經網絡,原始的 autograd 可能會有點太低了。
在構建神經網絡時,我們經常想到將計算安排在層中,其中某些層具有可學習的參數(shù),這些參數(shù)將在學習期間進行優(yōu)化。
在 TensorFlow 中,像 Keras , TensorFlow-Slim 和 TFLearn 之類的軟件包在原始計算圖上提供了更高層次的抽象,可用于構建神經網絡。
在 PyTorch 中,nn
包也達到了相同的目的。 nn
包定義了一組模塊,它們大致等效于神經網絡層。 模塊接收輸入張量并計算輸出張量,但也可以保持內部狀態(tài),例如包含可學習參數(shù)的張量。 nn
程序包還定義了一組有用的損失函數(shù),這些函數(shù)通常在訓練神經網絡時使用。
在此示例中,我們使用nn
包來實現(xiàn)我們的兩層網絡:
# -*- coding: utf-8 -*-
import torch
## N is batch size; D_in is input dimension;
## H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
## Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
## Use the nn package to define our model as a sequence of layers. nn.Sequential
## is a Module which contains other Modules, and applies them in sequence to
## produce its output. Each Linear Module computes output from input using a
## linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
## The nn package also contains definitions of popular loss functions; in this
## case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(500):
# Forward pass: compute predicted y by passing x to the model. Module objects
# override the __call__ operator so you can call them like functions. When
# doing so you pass a Tensor of input data to the Module and it produces
# a Tensor of output data.
y_pred = model(x)
# Compute and print loss. We pass Tensors containing the predicted and true
# values of y, and the loss function returns a Tensor containing the
# loss.
loss = loss_fn(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
# Zero the gradients before running the backward pass.
model.zero_grad()
# Backward pass: compute gradient of the loss with respect to all the learnable
# parameters of the model. Internally, the parameters of each Module are stored
# in Tensors with requires_grad=True, so this call will compute gradients for
# all learnable parameters in the model.
loss.backward()
# Update the weights using gradient descent. Each parameter is a Tensor, so
# we can access its gradients like we did before.
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
到目前為止,我們通過手動更改持有可學習參數(shù)的張量(使用torch.no_grad()
或.data
來避免在自動分級中跟蹤歷史記錄)來更新模型的權重。 對于像隨機梯度下降這樣的簡單優(yōu)化算法而言,這并不是一個巨大的負擔,但是在實踐中,我們經常使用更復雜的優(yōu)化器(例如 AdaGrad,RMSProp,Adam 等)來訓練神經網絡。
PyTorch 中的optim
軟件包抽象了優(yōu)化算法的思想,并提供了常用優(yōu)化算法的實現(xiàn)。
在此示例中,我們將使用nn
包像以前一樣定義我們的模型,但是我們將使用optim
包提供的 Adam
算法優(yōu)化模型:
# -*- coding: utf-8 -*-
import torch
## N is batch size; D_in is input dimension;
## H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
## Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
## Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
## Use the optim package to define an Optimizer that will update the weights of
## the model for us. Here we will use Adam; the optim package contains many other
## optimization algoriths. The first argument to the Adam constructor tells the
## optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
# Forward pass: compute predicted y by passing x to the model.
y_pred = model(x)
# Compute and print loss.
loss = loss_fn(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
# Before the backward pass, use the optimizer object to zero all of the
# gradients for the variables it will update (which are the learnable
# weights of the model). This is because by default, gradients are
# accumulated in buffers( i.e, not overwritten) whenever .backward()
# is called. Checkout docs of torch.autograd.backward for more details.
optimizer.zero_grad()
# Backward pass: compute gradient of the loss with respect to model
# parameters
loss.backward()
# Calling the step function on an Optimizer makes an update to its
# parameters
optimizer.step()
有時,您將需要指定比一系列現(xiàn)有模塊更復雜的模型。 對于這些情況,您可以通過子類化nn.Module
并定義一個forward
來定義自己的模塊,該模塊使用其他模塊或在 Tensors 上的其他自動轉換操作來接收輸入 Tensors 并生成輸出 Tensors。
在此示例中,我們將兩層網絡實現(xiàn)為自定義的 Module 子類:
# -*- coding: utf-8 -*-
import torch
class TwoLayerNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we instantiate two nn.Linear modules and assign them as
member variables.
"""
super(TwoLayerNet, self).__init__()
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return
a Tensor of output data. We can use Modules defined in the constructor as
well as arbitrary operators on Tensors.
"""
h_relu = self.linear1(x).clamp(min=0)
y_pred = self.linear2(h_relu)
return y_pred
## N is batch size; D_in is input dimension;
## H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
## Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
## Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)
## Construct our loss function and an Optimizer. The call to model.parameters()
## in the SGD constructor will contain the learnable parameters of the two
## nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x)
# Compute and print loss
loss = criterion(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()
作為動態(tài)圖和權重共享的示例,我們實現(xiàn)了一個非常奇怪的模型:一個完全連接的 ReLU 網絡,該網絡在每個前向傳遞中選擇 1 到 4 之間的隨機數(shù),并使用那么多隱藏層,多次重復使用相同的權重 計算最里面的隱藏層。
對于此模型,我們可以使用常規(guī)的 Python 流控制來實現(xiàn)循環(huán),并且可以通過在定義前向傳遞時簡單地多次重復使用同一模塊來實現(xiàn)最內層之間的權重共享。
我們可以輕松地將此模型實現(xiàn)為 Module 子類:
# -*- coding: utf-8 -*-
import random
import torch
class DynamicNet(torch.nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we construct three nn.Linear instances that we will use
in the forward pass.
"""
super(DynamicNet, self).__init__()
self.input_linear = torch.nn.Linear(D_in, H)
self.middle_linear = torch.nn.Linear(H, H)
self.output_linear = torch.nn.Linear(H, D_out)
def forward(self, x):
"""
For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
and reuse the middle_linear Module that many times to compute hidden layer
representations.
Since each forward pass builds a dynamic computation graph, we can use normal
Python control-flow operators like loops or conditional statements when
defining the forward pass of the model.
Here we also see that it is perfectly safe to reuse the same Module many
times when defining a computational graph. This is a big improvement from Lua
Torch, where each Module could be used only once.
"""
h_relu = self.input_linear(x).clamp(min=0)
for _ in range(random.randint(0, 3)):
h_relu = self.middle_linear(h_relu).clamp(min=0)
y_pred = self.output_linear(h_relu)
return y_pred
## N is batch size; D_in is input dimension;
## H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
## Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
## Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)
## Construct our loss function and an Optimizer. Training this strange model with
## vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
# Forward pass: Compute predicted y by passing x to the model
y_pred = model(x)
# Compute and print loss
loss = criterion(y_pred, y)
if t % 100 == 99:
print(t, loss.item())
# Zero gradients, perform a backward pass, and update the weights.
optimizer.zero_grad()
loss.backward()
optimizer.step()
您可以在此處瀏覽以上示例。
Warm-up:numpy
PyTorch:張量
PyTorch:張量和自定等級
PyTorch:定義新的autograd函數(shù)
TensorFlow:靜態(tài)圖
PyTorch:nn
PyTorch:優(yōu)化
PyTorch:自定義nn模塊
PyTorch:控制流+權重共享
更多建議: