JuliaML Basics

JuliaML is a GitHub organization dedicated to building tools for machine learning in Julia. This post serves as an introduction to some of the building blocks which JuliaML provides. The two packages I'll discuss here are:

I'm assuming you have a little background in statistics/machine learning and an objective function such as

1n_i=1nf_i(β)+_j=1pλ_jJ(β_j)\frac{1}{n}\sum\_{i=1}^n f\_i(\beta)+\sum\_{j=1}^p \lambda\_j J(\beta\_j)

is recognizable to you as a "loss" + "penalty".

LossFunctions.jl

An interface for dealing with loss functions. Internally, most losses are defined as distance-based (a function of y^y\hat y-y) or margin-based (a function of yy^y * \hat y) where yy is the target and y^\hat y is a predicted value.

Distance-based losses

Margin-based losses

Example Usage

using LossFunctions

l = L1DistLoss()
value(l, .1, .2)  # |.2 - .1|
deriv(l, .1, .2)  # sign(.2 - .1)

y = randn(100)
yhat = randn(100)

value(l, y, yhat)  # vector of value mapped to y[i], yhat[i]
value(l, y, yhat, AvgMode.Sum())  # sum(value(l, y, yhat)), but better
value(l, y, yhat, AvgModel.Mean()) # mean(value(l, y, yhat)), but better

Naive Gradient Descent Algorithm

Assume we want to use a linear transformation so that our prediction of a vector yy is XβX\beta.

Our loss function looks like

1n_i=1nf(y_i,x_iTβ), \frac{1}{n}\sum\_{i=1}^n f(y\_i, x\_i^T\beta),

with gradient

XT[1ni=1nf(y_i,x_iTβ)]. X^T[\frac{1}{n}\sum_{i=1}^n f'(y\_i, x\_i^T\beta)].

We can implement a naive, inefficient gradient descent algorithm as:

function f(l::Loss, x::Matrix, y::Vector; maxit=20, s=.5)
    n, p = size(x)
    β = zeros(p)
    for i in 1:maxit
        β -= (s / n) * x' * deriv(l, y, x * β)
    end
    β
end

This automatically works with whatever loss function I provide because multiple dispatch is amazing:

# make some fake data
x = randn(1000, 3)
y = x * [1.0, 2.0, 3.0] + randn(1000)
julia> f(L2DistLoss(), x, y)
3-element Array{Float64,1}:
 0.966367
 2.05046
 2.94227

julia> f(L1DistLoss(), x, y)
3-element Array{Float64,1}:
 1.00185
 2.00302
 2.95002

julia> f(HuberLoss(2.), x, y)
3-element Array{Float64,1}:
 0.967642
 2.0485
 2.94429

PenaltyFunctions.jl

An interface for penalty/regularization functions. It follows similar conventions to LossFunctions.

θ = rand(5)  # parameter vector

p = L1Penalty()

value(p, θ)  # sum(abs, θ)
grad(p, θ)   # the gradient, sign.(θ)
prox(p, θ, .1)  # proximal mapping (soft-thresholding) with step size .1

Let's change our gradient descent algorithm to proximal gradient method:

function g(l::Loss, pen::Penalty, x::Matrix, y::Vector; maxit=20, s=.5, λ=.1)
    n, p = size(x)
    β = zeros(p)
    for i in 1:maxit
        β = prox(pen, β - (s / n) * x' * deriv(l, y, x * β), s * λ)
    end
    β
end

We can now do proximal gradient method on arbitrary loss/penalty combinations because multiple dispatch is amazing:

julia> g(L2DistLoss(), L1Penalty(), x, y; λ = .4)
3-element Array{Float64,1}:
 0.789894
 1.80736
 2.81209

julia> g(L1DistLoss(), L1Penalty(), x, y; λ = .4)
3-element Array{Float64,1}:
 0.0
 0.464488
 1.5998

julia> g(HuberLoss(2.), L1Penalty(), x, y; λ = .4)
3-element Array{Float64,1}:
 0.525633
 1.57115
 2.57907

Takeaway

These two packages create a consistent "grammar" for losses and regularization. Julia's multiple dispatch then allows us to write general algorithms like the proximal gradient method example. This paves the way for machine learning experimentation that is unavailable in any other language that I know of.