Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
Hasktorch is a library for scientific computing and differentiable programming.
Synopsis
Documentation
What is Hasktorch?
Hasktorch is a Haskell library for scientific computing and
differentiable programming. It leverages libtorch
(the backend
library powering PyTorch) for efficient tensor manipulation and
automatic differentiation, while bringing to bear Haskell's expressive
type system and first-class support for for the functional programming
paradigm.
Goal of this tutorial
The sequence of topics and examples here is loosely based on the PyTorch tutorial Deep Learning with PyTorch: A 60 Minute Blitz by Soumith Chintala.
In this tutorial we will implement a simple machine learning model. Along the way, you will learn to
- create and manipulate tensors
- build computation graphs from tensors and compute gradients
- optimize parameters with respect to an objective function
Usage
The reader is encouraged to follow along with the examples in a GHCi session.
To start, import Torch
:
>>>
import Torch
Tensors
A Tensor
in Hasktorch is multidimensional array with a fixed shape
and element type.
For example, we can initialize a tensor with shape [3, 4]
and filled
with zeros using
>>>
Torch.zeros' [3, 4]
Tensor Float [3,4] [[ 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, 0.0000]]
We can also initialize a tensor from a Haskell list using
asTensor
:
>>>
asTensor ([[4, 3], [2, 1]] :: [[Float]])
Tensor Float [2,2] [[ 4.0000 , 3.0000 ], [ 2.0000 , 1.0000 ]]
Note that the numerical type of the tensor is inferred from the types of the values in the list.
Scalar values are represented in Hasktorch as tensors with shape []
:
>>>
asTensor 3.5
Tensor Double [] 3.5000
We can get the scalar value back out using asValue
:
>>>
asValue (asTensor 3.5)
3.5
Specifying Tensor parameters
In the previous section we initialized a tensor filled with zeros
using zeros'
(note the prime suffix). Hasktorch functions use a
convention where default versions of functions use a prime suffix. The
unprimed versions of these functions expect an additional parameter
specifying tensor parameters. For example:
zeros :: [Int] -> TensorOptions
-> Tensor
TensorOptions
are typically specified by starting with
defaultOpts
and modifying using one or more of
the following:
withDType
configures the data type of the elementswithDevice
configures on which device the tensor is to be used- others (see
TensorOptions
)
For example, to construct a matrix filled with zeros of dtype Int64
:
>>>
zeros [4, 4] (withDType Int64 defaultOpts)
Tensor Int64 [4,4] [[ 0, 0, 0, 0], [ 0, 0, 0, 0], [ 0, 0, 0, 0], [ 0, 0, 0, 0]]
Tensor factories
Hasktorch comes with many "factory" functions similar to zeros
and
zeros'
useful for initializing common kinds of tensors. For example,
ones
,
full
,
eye
,
and the primed versions of these. See TensorFactories
for a
complete list.
One useful class of factory functions are those suffixed with "-like"
(e.g. onesLike
), which initialize a tensor
with the same dimensions as their argument. For example:
>>>
let x = zeros' [3, 2]
>>>
onesLike x
Tensor Float [3,2] [[ 1.0000 , 1.0000 ], [ 1.0000 , 1.0000 ], [ 1.0000 , 1.0000 ]]
Operations
Most operations are pure functions, similar to Haskell standard library math operations.
Tensors implement the Num
typeclass:
>>>
let x = ones' [4]
>>>
x + x
Tensor Float [4] [ 2.0000 , 2.0000 , 2.0000 , 2.0000 ]
Some operations transform a tensor:
>>>
Torch.relu (asTensor ([-1.0, -0.5, 0.5, 1] :: [Float]))
Tensor Float [4] [ 0.0000, 0.0000, 0.5000 , 1.0000 ]
selectDim
slices out a selection by specifying a dimension and index:
>>>
let x = asTensor [[[1, 2, 3]], [[4, 5, 6]], [[7, 8, 9]], [[10, 11, 12]]]
>>>
shape x
[4,1,3]
>>>
select 2 1 x
Tensor Double [4,1] [[ 2.0000 ], [ 5.0000 ], [ 8.0000 ], [ 11.0000 ]]
>>>
let y = asTensor [1, 2, 3]
>>>
Torch.select 0 1 y
Tensor Double [] 2.0000
Values can be extracted from a tensor using asValue
so long as the
dtype matches the Haskell type:
>>>
let x = asTensor ([2] :: [Int])
>>>
let y = asValue x :: Int
>>>
y
2
Randomness
Create a randomly initialized matrix:
>>>
x <-randIO' [2, 2]
>>>
x
Tensor Float [2,2] [[ 0.2961 , 0.5166 ], [ 0.2517 , 0.6886 ]]
Note that since random initialization returns a different result each time, unlike other tensor constructors, is monadic reflecting the context of an underlying random number generator (RNG) changing state.
Hasktorch includes variations of random tensor initializers in which the
RNG object is threaded explicitly rather than implicitly. See the
Torch.Random
module functions for details and variations. Samplers for
which the RNG is not explicit such as randIO'
example above use the
-IO
suffix.
Automatic Differentiation
Automatic differentiation is achieved through the use of two primary
functions in the Autograd
module,
makeIndependent
and grad
.
Independent Tensors
makeIndependent
is used to instantiate an independent tensor variable
from which a compute graph is constructed for differentiation, while
grad
uses compute graph to compute gradients.
makeIndependent
takes a tensor as input and returns an IO action
which produces an IndependentTensor
:
makeIndependent :: Tensor -> IO IndependentTensor
What is the definition of the IndependentTensor
type produced by the
makeIndependent
action? It’s defined in the Hasktorch library as:
newtype IndependentTensor = IndependentTensor { toDependent :: Tensor } deriving (Show)
Thus IndependentTensor
is simply a wrapper around the underlying
Tensor that is passed in as the argument to makeIndependent
. Building
up computations using ops applied to the toDependent
tensor of an
IndependentTensor
will implicitly construct a compute graph to which
grad
can be applied.
All tensors have an underlying property that can be retrieved using
the requiresGrad
function which indicates whether
they are a differentiable value in a compute graph. [1]
>>>
let x = asTensor [1, 2, 3]
>>>
y <- makeIndependent (asTensor [4, 5, 6])
>>>
let y' = toDependent y
>>>
let z = x + y'
>>>
requiresGrad x
False
>>>
requiresGrad y'
True
>>>
requiresGrad z
True
In summary, tensors that are computations of values derived from tensor
constructors (e.g. ones
, zeros
, fill
, randIO
etc.) outside the
context of a IndependentTensor
are not differentiable. Tensors that
are derived from computations on the toDependent
value of an
IndependentTensor
are differentiable, as the above example
illustrates.
Gradients
Once a computation graph is constructed by applying ops and computing
derived quantities stemming from a toDependent
value of an
IndependentTensor
, a gradient can be taken by using the grad
function specifying in the first argument tensor corresponding to
function value of interest and a list of Independent
tensor variables
that the the derivative is taken with respect to:
grad :: Tensor -> [IndependentTensor] -> [Tensor]
Let’s demonstrate this with a concrete example. We create a tensor and
derive an IndependentTensor
from it:
>>>
x <- makeIndependent (ones' [2, 2])
>>>
let x' = toDependent x
>>>
x'
Tensor Float [2,2] [[ 1.0000 , 1.0000 ], [ 1.0000 , 1.0000 ]]
Now do some computations on the dependent tensor:
>>>
let y = x' + 2
>>>
y
Tensor Float [2,2] [[ 3.0000 , 3.0000 ], [ 3.0000 , 3.0000 ]]
Since y is dependent on the x independent tensor, it is differentiable:
>>>
requiresGrad y
True
Applying more operations:
>>>
let z = y * y * 3
>>>
let out = mean z
>>>
z
Tensor Float [2,2] [[ 27.0000 , 27.0000 ], [ 27.0000 , 27.0000 ]]
Now retrieve the gradient:
>>>
grad out [x]
[Tensor Float [2,2] [[ 4.5000 , 4.5000 ], [ 4.5000 , 4.5000 ]]]
Differentiable Programs (Neural Networks)
From a functional programming perspective, a neural network is represented by data and functions, much like any other functional program. The only distinction that differentiates neural networks from any other functional program is that it implements a small interface surface to support differentiation. Thus, we can consider neural networks to be "differentiable functional programming".
The data in neural networks are the values to be fitted that parameterize the functions which carry out the inference operation and are modified based on gradients of through those functions.
As with a regular Haskell program, this data is represented by an algebraic data type (ADT). The ADT can take on any shape that’s needed to model the domain of interest, allowing a great deal of flexibility and enabling all of Haskell’s strenghts in data modeling - can use sum or product types, nest types, etc. The ADT can implement various typeclasses to take on other functionality.
The core interface that defines capability specific to differentiable
programming is the Parameterized
typeclass:
class Parameterized f where flattenParameters :: f -> [Parameter] default flattenParameters :: (Generic f, Parameterized' (Rep f)) => f -> [Parameter] flattenParameters f = flattenParameters' (from f) replaceOwnParameters :: f -> ParamStream f default replaceOwnParameters :: (Generic f, Parameterized' (Rep f)) => f -> ParamStream f replaceOwnParameters f = to <$> replaceOwnParameters' (from f)
Note Parameter
is simply a type alias for IndependentTensor
in the
context of neural networks (i.e. type Parameter = IndependentTensor
).
The role of flattenParameters
is to unroll any arbitrary ADT
representation of a neural network into a standard flattened
representation consisting a list of IndependentTensor
which is used to
compute gradients.
replaceOwnParameters
is used to update parameters. ParamStream is a
type alias for a State type with state represented by a Parameter
list
and a value parameter corresponding to the ADT defining the model.
type ParamStream a = State [Parameter] a
Note the use of generics. Generics allow the compiler to usually
automatically derive flattenParameters
and replaceOwnParameter
instances without any code if your type is built up on tensors,
containers of tensors, or other types that are built from tensor values
(for example, layer modules provided in Torch.NN
. In many cases, as
you’ll see in the following examples, you will only need to add
instance Parameterized MyNeuralNetwork
(where MyNeuralNetwork
is an ADT definition for your model) and the
compiler will derive implementations for the flattenParameters
and
replaceOwnParameters
.
Linear Regression
Lets start with a simple example of linear regression. Here we generate random data with an underlying affine relationship between the inputs and outputs, then fit a linear regression to reproduce that relationship.
This example is adapted from https://github.com/hasktorch/hasktorch/tree/master/examples/regression.
In a standard supervised learning model, the neural network is initialized using a randomized initialization scheme. An iterative optimization is performed such that at each iteration a batch.
module Main where import Control.Monad (when) import Torch groundTruth :: Tensor -> Tensor groundTruth t = squeezeAll $ matmul t weight + bias where weight = asTensor ([42.0, 64.0, 96.0] :: [Float]) bias = full' [1] (3.14 :: Float) model :: Linear -> Tensor -> Tensor model state input = squeezeAll $ linear state input main :: IO () main = do init <- sample $ LinearSpec{in_features = numFeatures, out_features = 1} randGen <- mkGenerator (Device CPU 0) 12345 (trained, _) <- foldLoop (init, randGen) 2000 $ \(state, randGen) i -> do let (input, randGen') = randn' [batchSize, numFeatures] randGen (y, y') = (groundTruth input, model state input) loss = mseLoss y y' when (i `mod` 100 == 0) $ do putStrLn $ "Iteration: " ++ show i ++ " | Loss: " ++ show loss (newParam, _) <- runStep state GD loss 5e-3 pure (replaceParameters state newParam, randGen') pure () where batchSize = 4 numFeatures = 3
Note the expression of the architecture in the linear
function (a single linear layer, or alternatively a neural network
with zero hidden layers), does not require an explicit representation
of the compute graph, but is simply a composition of tensor
ops. Because of the autodiff mechanism described in the previous
section, the graph is constructed automatically as pure functional ops
are applied, given a context of a set of independent variables.
The init
variable is initialized as a Linear
type (defined in
Torch.NN
) using sample
which randomly initializes a Linear
value.
Linear
is a built-in ADT implementing the Parameterized
typeclass
and representing a fully connected linear layer, equivalent to linear
regression when no hidden layers are present.
init
is passed into the foldLoop
[2] as the
state@
variable.
A new list of Parameter
values is passed back from
runStep
(which calls grad
to retrieve gradients, given
a loss function, learning rate, and optimizer) and the typeclass
function replaceParameters
is used to update the model at each
iteration.
Initialization is discussed in more detail in the following section
Weight Initialization
Random initialization of weights is not a pure function since two
random initializations return different values. Initialization occurs
by calling the sample
function for an ADT (spec
)
implementing the Randomizable
typeclass:
class Randomizable spec f | spec -> f where sample :: spec -> IO f
In a typical (but not required) usage, f
is an ADT that implements the
Parameterized
typeclass, so that there’s a pair of types - a
specification type implementing the spec
input to sample
and a type
implementing Parameterizable
representing the model state.
For example, a linear fully connected layer is provided by the
Torch.NN
module and defined therein as:
data Linear = Linear { weight :: Parameter, bias :: Parameter } deriving (Show, Generic)
and is typically used with a specification type:
data LinearSpec = LinearSpec { in_features :: Int, out_features :: Int } deriving (Show, Eq)
Putting this together, in untyped tensor usage, the user can implement
custom models or layers implementing the Parameterizable
typeclass
built up from other ADTs implementing Parameterizable
. The shape of
the data required for initialization is described by a type implementing
Randomizable
’s spec
parameter, and the sample
implementation
specifies the default weight initialization.
Note this initialization approach is specific to untyped tensors. One
consequence of using typed tensors is that the information in these
spec
types is reflected in the type itself and thus are not needed.
What if you want to use a custom initialization that differs from the
default? You can define an alternative function with the same signature
spec -> IO f
and use the alternative function instead of sample
.
Optimizers
Optimization implementations are functions that take as input the current parameter values of a model, parameter gradient estimates of the loss function at those parameters for a single batch, and a characteristic learning describing how large a perturbation to make to the parameters in order to reduce the loss. Given those inputs, they output a new set of parameters.
In the simple case of stochastic gradient descent, the function to output a new set of parameters is to subtract from the current parameter \(\theta\), the gradient of the loss \(\nabla J\) scaled by the learning rate \(\eta\):
\[\theta_{i+1} = \theta_i - \eta \nabla J(\theta)\]
While stochastic gradient descent is a stateless function of the parameters, loss, and gradient, some optimizers have a notion of internal state that is propagated from one step to the step, for example, retaining and updating momentum between steps:
\[\begin{gathered} \Delta \theta_i = \alpha \Delta \theta_{i-1} - \eta \nabla J(\theta) \\ \theta_{i+1} = \theta_i + \Delta \theta_i \end{gathered}\]
In this case, the momentum term \(\Delta \theta_i\) is carried forward as internal state of the optimizer that is propagated to the next step. \(\alpha\) is an optimizer parameter which determines a weighting on the momentum term relative to the gradient.
Implementation of an optimizer consists of defining an ADT describing
the optimizer state and a step
function that implements a single step
perturbation given the learning rate, loss gradients, current
parameters, and optimizer state.
This function interface is described in the Optimizer
typeclass interface:
class Optimizer o where step :: LearningRate -> Gradients -> [Tensor] -> o -> ([Tensor], o)
Gradients
is a newtype wrapper around a list of tensors to make
intent explicit: newtype Gradients = Gradients [Tensor]
.
Hasktorch provides built-in optimizer implementations in Torch.Optim
.
Some illustrative example implementations follow.
Being stateless, stochastic gradient descent has an ADT that has only one constructor value:
data GD = GD
and implements the step function as:
instance Optimizer GD where step lr gradients depParameters dummy = (gd lr gradients depParameters, dummy) where step p dp = p - (lr * dp) gd lr (Gradients gradients) parameters = zipWith step parameters gradients
The use of an optimizer was illustrated in the linear regression example
using the function runStep
(newParam, _) <- runStep state GD loss 5e-3
In this case the new optimizer state returned is ignored (as _
) since
gradient descent does not have any internal state. Under the hood,
runStep
does a little bookkeeping making independent variables from a
model, computing gradients, and passing values to the step
function.
Usually a user can ignore the details and just pass model parameters and
the optimizer to runStep as an abstracted interface which takes
parameter values, the optimizer value, loss (a tensor), and learning
rate as input and returns new parameters and an updated optimizer value.
runStep :: (Parameterized p, Optimizer o) => p -> o -> Tensor -> LearningRate -> IO ([Parameter], o)
Typed Tensors
Typed tensors provide an alternative API for which tensor characteristics are encoded in the type of the tensor. The Hasktorch library is layered such that [ TODO ]
Using typed tensors increases the expressiveness of program invariants that can be automatically checked by GHC at the cost of needing to be more explicit in writing code and also requiring working with Haskell’s type-level machinery. Type-level Haskell programming has arisen through progressive compiler iteration leading to mechanisms that have a higher degree of complexity compared to value-level Haskell code or other languages such as Idris which were designed with type-level computations from inception. In spite of these compromises, Haskell offers enough capability to express powerful type-level representations of models that is matched by few other languages used for production applications.
Things we can do in typed Hasktorch:
- specify, check, and infer tensor shapes at compile time
- specify, check, and infer tensor data types at compile time
- specify, check, and infer tensor compute devices at compile time
We can encode all Hasktorch tensor shapes on the type level using type-level lists and natural numbers:
type EmptyShape = '[] type OneDimensionalShape (a :: Nat) = '[a] type TwoDimensionalShape (a :: Nat) (b :: Nat) = '[a, b] ...
Tensor data types and compute device types are lifted to the type level
using the DataKinds
language extension:
type BooleanCPUTensor (shape :: [Nat]) = Tensor '(CPU, 0) 'Bool shape type IntCUDATensor (shape :: [Nat]) = Tensor '(CUDA, 1) 'Int64 shape
Devices are represented as tuples consisting of a DeviceType
(here
CPU
for the CPU and CUDA
for a CUDA device, respectively) and a
device id (here the Nat
s 0
and 1
, respectively).
It is a common misconception that specifying tensor properties at compile time is only possible if all tensor properties are constants and are statically known. If this were the case, then we could only write functions over fully specified tensors, say,
boring :: BooleanCPUTensor '[] -> BooleanCPUTensor '[] boring = id
Fortunately, Haskell has the ability to reason about type variables. This feature is called parametric polymorphism. Consider this simple example of a function:
tensorNoOp :: forall (shape :: [Nat]) (dtype :: DType) (device :: (DeviceType, Nat)) . Tensor device dtype shape -> Tensor device dtype shape tensorNoOp = id
Here, shape
, dtype
, and device
are type variables that have been
constrained to be of kind shape, data type, and device, respectively.
The universal quantifier forall
implies that this function is
well-defined for all inhabitants of the types that are compatible with
the type variables and their constraints.
The tensorNoOp
function may seem trivial, and that is because its type
is very strongly constrained: Given any typed tensor, it must return a
tensor of the same shape and with the same data type and on the same
device. Besides by means of the identity, id
, there are not many ways
in which this function can be implemented.
There is a connection between a type signature of a function and
mathematical proofs. We can say that the fact that a function exists is
witnessed by its implementation. The implementation, tensorNoOp = id
,
is the proof of the theorem stated by tensorNoOp
’s type signature. And
here we have a proof that we can run this function on any device and for
any data type and tensor shape.
This is the essence of typed Hasktorch. As soon as the compiler gives its OK, we hold a proof that our program will run without shape or CUDA errors.
- PyTorch users will be familiar with this as the
requires_grad
member variable for the PyTorch tensor type. The Hasktorch mechanism is distinct from PyTorch’s mechanism - by only allowing gradients to be applied in the context of a set ofIndependentTensor
variables, it allows ops to be semantically pure and preserve referential transparency. foldLoop
is a convenience function defined in terms offoldM
asfoldLoop x count block = foldM block x ([1 .. count] :: [a])