Linear Regression
Let's start with a simple example of linear regression. Here we generate random data with an underlying affine relationship between the inputs and outputs, then fit a linear regression to reproduce that relationship.
This example is adapted from https://github.com/hasktorch/hasktorch/tree/master/examples/regression.
In a standard supervised learning model, the neural network is initialized using a randomized initialization scheme. An iterative optimization is performed such that at each iteration a batch.
Here is a simple end-to-end example in Hasktorch:
module Main where
import Control.Monad (when)
import Torch
groundTruth :: Tensor -> Tensor
groundTruth t = squeezeAll $ matmul t weight + bias
where
weight = asTensor ([42.0, 64.0, 96.0] :: [Float])
bias = full' [1] (3.14 :: Float)
model :: Linear -> Tensor -> Tensor
model state input = squeezeAll $ linear state input
main :: IO ()
main = do
init <- sample $ LinearSpec{in_features = numFeatures, out_features = 1}
randGen <- mkGenerator (Device CPU 0) 12345
(trained, _) <- foldLoop (init, randGen) 2000 $ \(state, randGen) i -> do
let (input, randGen') = randn' [batchSize, numFeatures] randGen
(y, y') = (groundTruth input, model state input)
loss = mseLoss y y'
when (i `mod` 100 == 0) $ do
putStrLn $ "Iteration: " ++ show i ++ " | Loss: " ++ show loss
(state', _) <- runStep state GD loss 5e-3
pure (state', randGen')
pure ()
where
batchSize = 4
numFeatures = 3
Let's break this down in pieces:
The
init
variable is initialized as aLinear
type (defined inTorch.NN
) usingsample
which randomly initializes aLinear
value. Initialization is discussed in more detail in the following sectionLinear
is a built-in algebraic data type (ADT) implementing theParameterized
typeclass and representing a fully connected linear layer, equivalent to linear regression when no hidden layers are present.init
is passed into theTorch.Optim.foldLoop
as the state variable. Note thatfoldLoop
is just a convenience function defined usingfoldM
:foldLoop :: a -> Int -> (a -> Int -> IO a) -> IO a foldLoop init count body = foldM body init [1 .. count]
At each optimization step,
Torch.Optim.runStep
computes an updated model state given the current state, optimizer, loss function, and learning rate.
Note the expression of the architecture in the Torch.NN.linear
function (a single linear layer, or alternatively a neural network
with zero hidden layers) does not require an explicit representation
of the compute graph, but is simply a composition of tensor
ops. Because of the autodiff mechanism described in the previous
section, the graph is constructed automatically as pure functional ops
are applied, given a context of a set of independent variables.
Weight Initialization
Random initialization of weights is not a pure function since two
random initializations return different values. Initialization occurs
by calling the Torch.NN.sample
function for an ADT implementing the
Torch.NN.Randomizable
typeclass:
class Randomizable spec f | spec -> f where
sample :: spec -> IO f
In a typical (but not required) usage, f
is an ADT that implements
the Parameterized
typeclass, so that there's a pair of types—a
specification type implementing the spec
input to sample
and a
type implementing Parameterizable
representing the model state.
For example, a linear fully connected layer is provided by the
Torch.NN
module and defined therein as:
data Linear = Linear {weight :: Parameter, bias :: Parameter} deriving (Show, Generic)
and is typically used with a specification type:
data LinearSpec = LinearSpec {in_features :: Int, out_features :: Int}
deriving (Show, Eq)
Putting this together, in untyped tensor usage, the user can implement
custom models or layers implementing the Parameterizable
typeclass
built up from other ADTs implementing Parameterizable
. The shape of
the data required for initialization is described by a type
implementing Randomizable
's spec
parameter, and the sample
implementation specifies the default weight initialization.
Note this initialization approach is specific to untyped tensors. One
consequence of using typed tensors is that the information in these
spec
types is reflected in the type itself and thus are not needed.
What if you want to use a custom initialization that differs from the
default? You can define an alternative function with the same
signature spec -> IO f
and use the alternative function instead of
sample
.
Optimizers
Optimization implementations are functions that take as input the current parameter values of a model, parameter gradient estimates of the loss function at those parameters for a single batch, and a characteristic learning describing how large a perturbation to make to the parameters in order to reduce the loss. Given those inputs, they output a new set of parameters.
In the simple case of stochastic gradient descent, the function to output a new set of parameters is to subtract from the current parameter (θ), the gradient of the loss ∇ J scaled by the learning rate η:
$$\theta_{i+1} = \theta_i - \eta \nabla J(\theta)$$
While stochastic gradient descent is a stateless function of the parameters, loss, and gradient, some optimizers have a notion of internal state that is propagated from one step to the step, for example, retaining and updating momentum between steps:
$$\Delta \theta_i = \alpha \Delta \theta_{i-1} - \eta \nabla J(\theta)$$ $$\theta_{i+1} = \theta_i + \Delta \theta_i$$
In this case, the momentum term Δ θᵢ is carried forward as internal state of the optimizer that is propagated to the next step. α is an optimizer parameter which determines a weighting on the momentum term relative to the gradient.
Implementation of an optimizer consists of defining an ADT describing
the optimizer state and a step
function that implements a single
step perturbation given the learning rate, loss gradients, current
parameters, and optimizer state.
This function interface is described in the Torch.Optim.Optimizer
typeclass interface:
class Optimizer o where
step :: LearningRate -> Gradients -> [Tensor] -> o -> ([Tensor], o)
Gradients
is a newtype wrapper around a list of tensors to make
intent explicit: newtype Gradients = Gradients [Tensor]
.
Hasktorch provides built-in optimizer implementations in
Torch.Optim
. Some illustrative example implementations follow.
Being stateless, stochastic gradient descent has an ADT that has only one constructor value:
data GD = GD
and implements the step function as:
instance Optimizer GD where
step lr gradients depParameters dummy = (gd lr gradients depParameters, dummy)
where
step p dp = p - (lr * dp)
gd lr (Gradients gradients) parameters = zipWith step parameters gradients
The use of an optimizer was illustrated in the linear regression example
using the function runStep
(state', _) <- runStep state GD loss 5e-3
In this case the new optimizer state returned is ignored (as _
)
since gradient descent does not have any internal state. Under the
hood, runStep
does a little bookkeeping making independent variables
from a model, computing gradients, and passing values to the step
function. Usually a user can ignore the details and just pass model
parameters and the optimizer to runStep
as an abstracted interface
which takes parameter values, the optimizer value, loss (a tensor),
and learning rate as input and returns updated model and optimizer
values.
runStep ::
(Parameterized model, Optimizer optimizer) =>
model ->
optimizer ->
Tensor ->
LearningRate ->
IO (model, optimizer)