Basics

Here we show how to use the essential ingredients of the package.

using Distributions
using HiddenMarkovModels
using LinearAlgebra
using Random
using StableRNGs

rng = StableRNG(63);

Model

The package provides a versatile HMM type with three main attributes:

a vector init of state initialization probabilities
a matrix trans of state transition probabilities
a vector dists of observation distributions, one for each state

Any scalar- or vector-valued distribution from Distributions.jl can be used for the last part, as well as Custom distributions.

init = [0.6, 0.4]
trans = [0.7 0.3; 0.2 0.8]
dists = [MvNormal([-0.5, -0.8], I), MvNormal([0.5, 0.8], I)]
hmm = HMM(init, trans, dists)

Hidden Markov Model with:
 - initialization: [0.6, 0.4]
 - transition matrix: [0.7 0.3; 0.2 0.8]
 - observation distributions: [IsoNormal(
dim: 2
μ: [-0.5, -0.8]
Σ: [1.0 0.0; 0.0 1.0]
)
, IsoNormal(
dim: 2
μ: [0.5, 0.8]
Σ: [1.0 0.0; 0.0 1.0]
)
]

Simulation

You can simulate a pair of state and observation sequences with rand by specifying how long you want them to be.

T = 20
state_seq, obs_seq = rand(rng, hmm, T);

The state sequence is a vector of integers.

state_seq[1:3]

3-element Vector{Int64}:
 2
 2
 1

The observation sequence is a vector whose elements have whatever type an observation distribution returns when sampled. Here we chose a multivariate normal distribution, so we get vectors at each time step.

Difference from HMMBase.jl

In the case of multivariate observations, HMMBase.jl works with matrices, whereas HiddenMarkovModels.jl works with vectors of vectors. This allows us to accept more generic observations than just numbers or vectors inside the sequence.

obs_seq[1:3]

3-element Vector{Vector{Float64}}:
 [0.46430432030033214, -0.8538832667367795]
 [0.07113025810845847, 0.35893904118945]
 [-0.8923441428307395, -0.5712530157810823]

In practical applications, the state sequence is not known, which is why we need inference algorithms to gather information about it.

Inference

The Viterbi algorithm (viterbi) returns:

the most likely state sequence $\hat{X}_{1:T} = \underset{X_{1:T}}{\mathrm{argmax}}~\mathbb{P}(X_{1:T} \vert Y_{1:T})$,
the joint loglikelihood $\mathbb{P}(\hat{X}_{1:T}, Y_{1:T})$ (in a vector of size 1).

best_state_seq, best_joint_loglikelihood = viterbi(hmm, obs_seq);
only(best_joint_loglikelihood)

-57.69066363984475

As we can see, the most likely state sequence is very close to the true state sequence, but not necessarily equal.

(state_seq .== best_state_seq)'

1×20 adjoint(::BitVector) with eltype Bool:
 0  0  1  1  1  1  1  1  1  1  1  1  1  0  1  1  1  1  1  1

The forward algorithm (forward) returns:

a matrix of filtered state marginals $\alpha[i, t] = \mathbb{P}(X_t = i | Y_{1:t})$,
the loglikelihood $\mathbb{P}(Y_{1:T})$ of the observation sequence (in a vector of size 1).

filtered_state_marginals, obs_seq_loglikelihood_f = forward(hmm, obs_seq);
only(obs_seq_loglikelihood_f)

-54.214899298272634

At each time $t$, these filtered marginals take only the observations up to time $t$ into account. This is particularly useful to infer the marginal distribution of the last state.

filtered_state_marginals[:, T]

2-element Vector{Float64}:
 0.7717057517405977
 0.2282942482594023

The forward-backward algorithm (forward_backward) returns:

a matrix of smoothed state marginals $\gamma[i, t] = \mathbb{P}(X_t = i | Y_{1:T})$,
the loglikelihood $\mathbb{P}(Y_{1:T})$ of the observation sequence (in a vector of size 1).

smoothed_state_marginals, obs_seq_loglikelihood_fb = forward_backward(hmm, obs_seq);
only(obs_seq_loglikelihood_fb)

-54.214899298272634

At each time $t$, it takes all observations up to time $T$ into account. This is particularly useful during learning. Note that forward and forward-backward only coincide at the last time step.

filtered_state_marginals[:, T - 1] ≈ smoothed_state_marginals[:, T - 1]

false

filtered_state_marginals[:, T] ≈ smoothed_state_marginals[:, T]

true

Finally, we provide a thin wrapper (logdensityof) around the forward algorithm for observation sequence loglikelihoods $\mathbb{P}(Y_{1:T})$.

logdensityof(hmm, obs_seq)

-54.214899298272634

Another function (joint_logdensityof) can compute joint loglikelihoods $\mathbb{P}(X_{1:T}, Y_{1:T})$ which take the states into account.

joint_logdensityof(hmm, obs_seq, state_seq)

-60.170162049564425

For instance, we can check that the output of Viterbi is at least as likely as the true state sequence.

joint_logdensityof(hmm, obs_seq, best_state_seq)

-57.69066363984476

Learning

The Baum-Welch algorithm (baum_welch) is a variant of Expectation-Maximization, designed specifically to estimate HMM parameters. Since it is a local optimization procedure, it requires a starting point that is close enough to the true model.

init_guess = [0.5, 0.5]
trans_guess = [0.6 0.4; 0.3 0.7]
dists_guess = [MvNormal([-0.4, -0.7], I), MvNormal([0.4, 0.7], I)]
hmm_guess = HMM(init_guess, trans_guess, dists_guess);

Let's estimate parameters based on a slightly longer sequence.

_, long_obs_seq = rand(rng, hmm, 200)
hmm_est, loglikelihood_evolution = baum_welch(hmm_guess, long_obs_seq);

An essential guarantee of this algorithm is that the loglikelihood of the observation sequence keeps increasing as the model improves.

first(loglikelihood_evolution), last(loglikelihood_evolution)

(-613.0225125416881, -605.3027257918021)

We can check that the transition matrix estimate has improved.

cat(transition_matrix(hmm_est), transition_matrix(hmm); dims=3)

2×2×2 Array{Float64, 3}:
[:, :, 1] =
 0.676764  0.323236
 0.197387  0.802613

[:, :, 2] =
 0.7  0.3
 0.2  0.8

And so have the estimates for the observation distributions.

map(mean, hcat(obs_distributions(hmm_est), obs_distributions(hmm)))

2×2 Matrix{Vector{Float64}}:
 [-0.256414, -0.563875]  [-0.5, -0.8]
 [0.652218, 0.841332]    [0.5, 0.8]

On the other hand, the initialization is concentrated on one state. This effect can be mitigated by learning from several independent sequences.

hcat(initialization(hmm_est), initialization(hmm))

2×2 Matrix{Float64}:
 1.91765e-19  0.6
 1.0          0.4

Since HMMs are not identifiable up to a permutation of the states, there is no guarantee that state $i$ in the true model will correspond to state $i$ in the estimated model. This is important to keep in mind when testing new models.

Multiple sequences

In many applications, we have access to various observation sequences of different lengths.

nb_seqs = 300
long_obs_seqs = [last(rand(rng, hmm, rand(rng, 100:200))) for k in 1:nb_seqs];
typeof(long_obs_seqs)

Vector{Vector{Vector{Float64}}} (alias for Array{Array{Array{Float64, 1}, 1}, 1})

Every algorithm in the package accepts multiple sequences in a concatenated form. The user must also specify where each sequence ends in the concatenated vector, by passing seq_ends as a keyword argument. Otherwise, the input will be treated as a unique observation sequence, which is mathematically incorrect.

long_obs_seq_concat = reduce(vcat, long_obs_seqs)
typeof(long_obs_seq_concat)

Vector{Vector{Float64}} (alias for Array{Array{Float64, 1}, 1})

seq_ends = cumsum(length.(long_obs_seqs))
seq_ends'

1×300 adjoint(::Vector{Int64}) with eltype Int64:
 174  279  388  539  737  894  1007  …  44408  44557  44737  44855  45032

The outputs of inference algorithms are then concatenated, and the associated loglikelihoods are split by sequence (in a vector of size length(seq_ends)).

best_state_seq_concat, best_joint_loglikelihood_concat = viterbi(
    hmm, long_obs_seq_concat; seq_ends
);

length(best_joint_loglikelihood_concat) == length(seq_ends)

true

length(best_state_seq_concat) == last(seq_ends)

true

The function seq_limits returns the begin and end of a given sequence in the concatenated vector. It can be used to untangle the results.

start2, stop2 = seq_limits(seq_ends, 2)

(175, 279)

best_state_seq_concat[start2:stop2] == first(viterbi(hmm, long_obs_seqs[2]))

true

While inference algorithms can also be run separately on each sequence without changing the results, considering multiple sequences together is nontrivial for Baum-Welch. That is why the package takes care of it automatically.

hmm_est_concat, _ = baum_welch(hmm_guess, long_obs_seq_concat; seq_ends);

Our estimate should be a little better.

cat(transition_matrix(hmm_est_concat), transition_matrix(hmm); dims=3)

2×2×2 Array{Float64, 3}:
[:, :, 1] =
 0.69677   0.30323
 0.207515  0.792485

[:, :, 2] =
 0.7  0.3
 0.2  0.8

map(mean, hcat(obs_distributions(hmm_est_concat), obs_distributions(hmm)))

2×2 Matrix{Vector{Float64}}:
 [-0.503771, -0.796337]  [-0.5, -0.8]
 [0.509728, 0.805274]    [0.5, 0.8]

hcat(initialization(hmm_est_concat), initialization(hmm))

2×2 Matrix{Float64}:
 0.629082  0.6
 0.370918  0.4

This page was generated using Literate.jl.