-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modular Interface #144
Comments
Cannot agree with you more |
Summarizing some points from this week's discussion:
|
I really like the idea of the modular interface. A simple, but not so elegant way to have this interface right now, is to add functions like Here is a more complete example: using Knet
# type defs
struct Dense{T,Tu}
w::Array{T, 2}
b::Array{T, 1}
unit::Tu
end
Dense(dimin, dimout; T = Float32, unit = relu, initfun = xavier) =
Dense(initfun(T, dimout, dimin), zeros(T, dimout), unit)
params(l::Dense) = Any[l.w, l.b]
(l::Dense)(x) = l.unit.(l.w * mat(x) .+ l.b)
(l::Dense)(w, x) = l.unit.(w[1] * mat(x) .+ w[2]) # needed for grad
struct Conv{T,Tu}
w::Array{T, 4}
b::Array{T, 4}
unit::Tu
convkargs
end
Conv(xdim, ydim, inc, outc; T = Float32, unit = relu, initfun = xavier, convkargs...) =
Conv(initfun(T, xdim, ydim, inc, outc), zeros(T, 1, 1, outc, 1), unit, convkargs)
params(l::Conv) = Any[l.w, l.b]
(l::Conv)(x) = l.unit.(conv4(l.w, x; l.convkargs...) .+ l.b)
(l::Conv)(w, x) = l.unit.(conv4(w[1], x; l.convkargs...) .+ w[2])
struct Pooling
poolkargs
end
Pooling(; kargs...) = Pooling(kargs)
params(l::Pooling) = Any[]
(l::Pooling)(x) = pool(x; l.poolkargs...)
(l::Pooling)(w, x) = l(x)
struct Chain
layers::Array{Any, 1}
end
Chain(l...) = Chain(collect(l))
params(c::Chain) = [params(l) for l in c.layers]
(c::Chain)(x) = foldl((x, c) -> c(x), x, c.layers)
(c::Chain)(w, x) = foldl((x, c) -> c[2](c[1], x), x, zip(w, c.layers))
struct Model
chain::Chain
w::Array{Any, 1}
end
Model(c) = Model(c, params(c))
# utility functions
loss(w, c, x, y) = nll(c(w, x), y)
gradfun = grad(loss)
function trainepoch!(model, data, gradfun, opt)
for (x, y) in data
update!(model.w, gradfun(model.w, model.chain, x, y), opt)
end
end
import Knet.optimizers
optimizers(m::Model, opt) = optimizers(model.w, opt)
import Knet.accuracy
accuracy(m::Model, data) = accuracy(m.w, data, m.chain)
# example
include(Knet.dir("data","mnist.jl"))
xtrn, ytrn, xtst, ytst = mnist()
dtrn = minibatch(xtrn, ytrn, 100)
dtst = minibatch(xtst, ytst, 100)
model = Model(Chain(Conv(5, 5, 1, 20), Pooling(), Conv(5, 5, 20, 50), Pooling(),
Dense(800, 500), Dense(500, 10, unit = identity)))
opt = optimizers(model, Adam)
@time for _ in 1:5
trainepoch!(model, dtrn, gradfun, opt)
println(accuracy(model, dtst))
end Are you interested in turning this into a pull request, or are there anyway many changes planned for julia v0.7, that would allow a more elegant way to implement this inferface? edit: After @CarloLucibello's comment I wrapped the chain and its parameters into the struct Model to avoid passing around w and the chain. This does not, however, solve the issue of method duplication. |
@denizyuret would autograd allow for a pytorch-style interface like: x = Rec(rand(10)) # the parameters of our network
y = sum(x) # our loss' output, still a Rec
backprop!(y)
x.grad # now contains the gradient If so, we could take pytorch's (and flux's) approach and initialize the parameters in the modules as Rec. (l::Dense)(x) = l.unit.(l.w * mat(x) .+ l.b)
(l::Dense)(w, x) = l.unit.(w[1] * mat(x) .+ w[2]) # needed for grad and passing both |
whatever ends up happening, please don't replace the current low-level interface with this member-variable approach. I really prefer working with Knet the way it currently works over the alternatives that are around, and would be sad to see that go away |
It seems the proposed approach doesn't replace the current low-level interface but provide a high level wrapper. Another straight forward way to avoid the method duplication is to define a helper macro w @w (l::Dense)(x) = unit.(w * mat(x) .+ b) which translates to (l::Dense)(ws,x) = begin
w,b = ws[l.name]
unit.(w * mat(x) .+ b)
end |
@CarloLucibello Yes, the method duplication is the non-elegant part. Not having to pass around w and chain is easy to fix (see updated comment above). |
@Evizero You call the |
In case that's not totally clear: julia> using Flux.Tracker: gradient
julia> gradient((x,y) -> sum(x.*y), [1,2,3], [4,5,6])
([4.0, 5.0, 6.0], [1.0, 2.0, 3.0]) In future this will become the primary interface to Flux's AD (it's the only sensible way to get nested derivatives), but I want to do it in a way that supports both a grad operator and modularity at the same time. |
I'm a bit late to the party, but without knowing about this thread I made my own modular interface for Knet. I cleaned it up and posted it in a repo in case it is of use: https://github.com/davidssmith/KnetLayers/blob/master/src/KnetLayers.jl Here's a preview of my still very dirty implementation:
I've found no significant speed penalty for running Knet using my Layers interface. One perk of the way I wrote it is that you can define the network without knowing anything about the input data. For example you can just write Also, my objects don't carry around their weights. Weight generation and use is done entirely in the |
@denizyuret, is the struct methods, i.e. (a::Affine)(w, x) = x*w[1] + w[2], fully supported by AutoGrad? I mean will AutoGrad lost track of some global weights and cause gradcheck to fail? If so we should only use normal functions and wait for when this is fully supported. f(w, x) = x*w[1] + w[2] |
OK, I got Knet and AutoGrad finally catch up with the times (see the latest master). The problem was the old AutoGrad could not see into or create structs, and g=grad(f) was a very rigid interface for specifying what the differentiation should be with respect to. Inspired by Flux and after some thinking I fit the new AutoGrad interface into four functions (Param, differentiate, gradient, and value described below) which allows Flux-like models if you have some familiarity with function-like objects. Everything should be backward compatible, so all examples with the old interface should still work (regular functions, parameters in first arg, using g=grad(f) etc.). I am really happy with the memory management, all differentiated related allocation is stored under the return value of
Here is an example notebook in the new style. |
This interface lacks deactivating params(i.e. differentiable weights) for specific parts of a model which is a combination of pre-exist models. We want to have a deactivate(::MyStruct) or detach(::MyStruct) which basically makes Params temporally normal parameter. |
Please check out issue #347. |
Knet's current API requires that all model parameters are collected together and passed into the model at once. This design has several issues: it makes code reuse more difficult, leading to hand-coded implementations of standard layers like
LSTM
in each model. It also makes it impossible to abstract over structure, so that (for example) replacingLSTM
withGRU
in a model requires significant refactoring rather than being a one-line change. Furthermore, it hampers performance: standard layers likeLSTM
cannot take advantage of the advanced static optimisations that are possible in other frameworks.I want to discuss and get feedback on a new design that attempts to solve these problems without compromising Knet's simplicity and flexibility.
AutoGrad Variables
Using a layer shouldn't require knowledge of its internals, so layers need to be able to create and manage their own parameters. In other words, tracked variables need to be part of the autograd API, rather than being hidden inside the
grad
function. Here's a sketch of what this could look like:A
Variable
is essentially the currentRec
; it stores at least a value, a gradient, and the operation / inputs that created it. Theback!
function performs backpropagation and updates the gradients of input variables recursively, sox
now stores a gradient that we can use for training.Layers
A layer is now just a function with some differentiable parameters.
We can construct an
Affine
layer and then call it like a function on a normal array. Because the params areVariable
s, the output will also be aVariable
which we can use for backpropagation. This design makes Keras-like abstractions pretty trivial:This method of defining layers is not at all incompatible with Knet's current approach (via
grad
); one can freely mix and match plain Julia functions with layers defined in this way.Training
Training does not need to change significantly; the only real difference is that in addition to the parameters and gradients supplied by
grad
, you'll have ones internal to the model. We can make it easy to pull these out by defining a common interface likeparams
(e.g.params(a::Affine) = [a.W, a.b]
); then you can train these variables in the obvious way.Having a common interface for collecting parameters also makes it easier to abstract out the training process itself; hopefully in future we'll have generic optimisers which hide this whole process.
Longer Term Plans
If you've followed Flux at all you may notice that the design is very similar. We've been working on things like generic optimisers, and a design like this would enable us to share those between Flux and Knet. More ambitiously, my hope is that we can use static, define-before-run models like
Affine
inside of Knet models. The static models can then be heavily optimised – e.g. with custom JITed GPU kernels or parallelism – but still seamlessly be used inside very dynamic Knet models.Hopefully the fact that this only adds to Knet's current interface should be enough evidence that it doesn't sacrifice flexibility. If you're still concerned I recommend checking out PyTorch, which has been very successful at achieving performance, flexibility and modularity with a very similar API. Overall I think it will make it much easier to create, modify and reuse Knet models.
Any thoughts?
The text was updated successfully, but these errors were encountered: