Intro to Hidden Markov Chains

By Bonolo Molopyane

In a situation where you wish to determine the returns on an investment, one may have all the expertise to do this but without certain information (missing pieces) it would not be possible to derive to a conclusive figure. In practical terms “assume you have the value of all returns of all assets in your portfolio; without the rate at which each asset produces the returns we will not have a true reflection of the portfolio returns at a specific point in time as such we may not provide accurate returns estimate.” A process termed the Hidden Markov model may be used to shed some light in this problem.

The process is split into two components: an observable component and an unobservable or ‘hidden’ component (van Handel, 2008). Nevertheless, from the observable process we can extract information about the “hidden” processes. As such our task is to determine the unobserved process from the observed one.

The Hidden Markov Models (HMM) has two defining properties. (i) It assumes that the observation at time  $t$ was generated by some process whose state ${S}_{t}$ is hidden from the observer and (ii) it assumes the state of this hidden process satisfies the Markov property. Complex as it may seem to some, HMM come naturally to understand once one understands what a Markov Model is. We will look up into these two model components then consider an advanced techniques that help construct these HMMs.

Constructing a HMM

The “Hidden Process”

A process is said to have the Markov property if:

For any $A{\subseteq}S$, any value $n$ and for any time value ${t}_{1}<{t}_{2}<{\dots}<{t}_{n}<{t}_{n+1}$ it is true that

$P\left[{X}_{{t}_{n+1}}{\in}A|{X}_{{t}_{1}}={x}_{{t}_{1}},{X}_{{t}_{2}}={x}_{{t}_{2}},{\dots},{X}_{{t}_{n}}={x}_{{t}_{n}}\right]=P\left[{X}_{{t}_{n+1}}{\in}A| {X}_{{t}_{n}}={x}_{{t}_{n}}\right]$

This means that to determine the next state of the process one can just consider the current the state the process is in and ignore everything that has occurred before as this information is already included in the current state.

We need some properties and definitions that will allow us to help eventually grasp the concept of HMM

1. Time Homogeneity: this occurs when the probability of moving from a to b is independent of time, i.e. it does not matter how far you are in the process; as long as the processes is going to move from a to b in one step the probability will be the same throughout. When a process has this property we say the this process is Time Homogenous and if not ; time non-homogenous
2. Though possible to work with infinite states but in our financial context, it suffices to work with a finite amount of states which are irreducible.
3. Irreducible States: It is possible to move from any one state to another over a certain number of steps.

Let ${p}_{\mathit{ij}}^{(t,t+1)}$ be the transition probability given by $P\left({x}_{t+1}=j|{x}_{t}=i\right)={p}_{\mathit{ij}}$, this is the probability that at time $t+1$ the process will be at state $j$ given that at time $t$ the process is in state $i$.

This probability matrix is such that:

$P=\left(\begin{matrix} {p}_{11} &{p}_{12} &{\dots} &{p}_{1n}\\ {p}_{21} &{p}_{22} &{\dots} &{p}_{2n}\\ \vdots &\vdots &\ddots &\vdots \\ {p}_{n1} &{p}_{n2} &{\dots} &{p}_{\mathit{nn}} \end{matrix}\right)$

n.b: these emission probabilities are the main drivers of where next the process may go. From our time homogeneity assumption we can calculate the probability that the process is in state $j$ after $t$ steps given it started at $i$ we multiply matrix $P$ with itself $t$ times then read off the ${ij}^{th}$ element of ${P}^{n}$

Example:

Let us consider two probability transition matrices each with two transition states, one that is Time-Homogeneous and one that is not.

The non-Time-Homogeneous case

${P}^{t,t+1}=\left(\begin{matrix} 1-\dfrac{2}{100+t}&\dfrac{2}{100+t}\\ \dfrac{1}{2+2t} &1-\dfrac{1}{2+2t}\\ \end{matrix}\right)$

Then ${p}_{00}^{(1,2)}=1-\dfrac{2}{10+1}=0.81$ and ${p}_{00}^{(5,6)}=1-\dfrac{2}{10+5}=0.866$

Here the probability of changing state depends on where you are in time. Contrary to this procedure, a time homogenous matrix gives constant probabilities that is independent of time.

$\left(\begin{matrix} 0.3 &0.7\\ 0.5 &0.5\\ \end{matrix}\right)$

On this case ${p}_{00}^{(1,2)}={p}_{00}^{(5,6)}=0.3$

Determine using the two state time homogeneous probability matrix given above to find ${p}_{01}^{(1,3)}$ and ${p}_{0,1}^{(2,4)}$

$\left(\begin{matrix} 0.3 &0.7\\ 0.5 &0.5\\ \end{matrix}\right)\left(\begin{matrix} 0.3 &0.7\\ 0.5 &0.5\\ \end{matrix}\right)=\left(\begin{matrix} 0.34 &0.55\\ 0.4 &0.6\\ \end{matrix}\right)$

Then ${p}_{01}^{(1,3)}=0.55$

The reader is to prove to themselves that ${p}_{0,1}^{(2,4)}$ is also 0.55

Visual representation of Markov transitions

Keeping our analysis simple let us work over a 3 state process. $S=\{1, 2, 3\}$

The probability of moving from any one state to any other states is given by the probability matrix P which is given by:

$\left(\begin{matrix} 0.22 & 0.33 & 0.45\\ 0.13 & 0 & 0.87\\ 0.40 & 0.11 & 0.49\\ \end{matrix}\right)$

An alternate view of this phenomenon may be given by the following diagram:

According to the diagram we may move to any state or remain in the current state over one transition (time step). This is true for A and C but once the process is at B it has to move in the next transition, this is because the probability of staying at B is 0.

With the Markov chains we generate as we shall see that they will hold information on the unobserved and will ultimately yield a more realistic model which is the whole reason we are looking at the HMMs at the first place.

The observable Process

The hidden states are determined by some properties which we can derive so as to better understand how these hidden states behave.

To derive estimates of the densities of this process we will need to solve numerous systems of equations. Algorithms like the Baum-Welch algorithm and the Viterbi algorithm give extremely accurate estimations but due to their complexity we will stay clear from them for the time being but return to them latter on. Instead we will look into the Kalman Filter as it follows a procedure similarity to that used in the HMM’s derivation thus it will give us an intuitive understanding of how HMM comes about. Kalman filter is a mathematical technique widely used in control systems and avionics to extract a signal from a series of incomplete and noisy measurements.

Taking a different spin on things I will tabulate the differences between the Kalman filter and the HMM methods first.

From the above we get a feeling that finding the estimates is simpler through the Kalman filter but at the same time we observe how the HMM takes estimation to a whole new level. I drew the table to demonstrate if not  allude how imperative it is for one to first be comfortable with the Kalman filter before attempting the finding estimates for the HMM.

Kalman filter

Key point about the Kalman Filter

• It is a means to find the estimates of the process. Filtering comes from its primitive use of reducing or “filtering out” unwanted variables. Which in our case is the estimation error

Filter Estimates

Suppose we have two processes, a state process and observation process given by the following system of linear equations:

${x}_{k+1}={A}_{k+1}{x}_{k}+{v}_{k+1}$

${y}_{k}={C}_{k}{x}_{k}+{w}_{k}$

Here ${A}_{k}$ and ${C}_{k}$ can be matrices or variables and can even be as simple as a constant value.

It is common to assume that ${v}_{k+1}$ and ${w}_{k+1}$ are independently and identically distributed Gaussian or normal distributions with mean 0 and some covariance matrix (ideally diagonal to reflect independence among observations).

The estimates from the Kalman filter equations are derived using fairly advanced methods which require sufficient understanding of multivariate analysis. As such, I will here give the moment equations for the expected value and variance of the hidden process.

The time update equations are

${x}_{k+1|k}={A}_{k+1}{x}_{k|k}$,

$\sum_{k+1|k}={A}_{k+1}\sum_{k|k}{A'}_{k+1}+{Q}_{k+1}$

And the measurements update equations:

${x}_{k+1|k+1}={{x}_{k+1|k}+K}_{k+1}\left(y_{k+1}-{C}_{k+1}{x}_{k+1|k}\right)$,

$\sum_{k+1|k+1}=\sum_{k+1|k}-\sum_{k+1|k}{C'}_{k+1}{\left({C}_{k+1}\sum_{k+1|k}{C'}_{k+1}+{R}_{k+1}\right)}^{-1}{C}_{k+1}\sum_{k+1|k}$

Where ${K}_{k+1}={C'}_{k+1}{\left({C}_{k+1}\sum_{k+1| k}{C'}_{k+1}+{R}_{k+1}\right)}^{-1}$, also $A$ and $Q$ are also appropriately dimensioned matrices

The second set of equations (measurements update) determine the where the mean and covariance of the process will be given the results of the first set of equations (time update).

The Kalman filter gain (${K}_{k+1}$) is used to reflect the significance of the error term between our synthesized model verses the some observed (usually historic) model. If the gain which is a probability between 0 and 1 is small it would mean that estimated model is relatively close reality i.e. a good measure of reality. And if this probability is significantly large, then it may indicate inefficiency in our model and call for re-evaluation with numerous error minimization method available within statistics. The fact is, determining the Kalman gain is just as important as obtaining the estimates of the desired process. The derivation of this filter gain will be provided in an appendix.

Stochastic Dynamic Programming

Eric B. Laber and his colleague Hua Zhou broke down how Dynamic Programming (DP) functions in a simple yet satisfactory and conscience manner. First they divide the problem (the process) into subintervals. Observing that this “sub-problems” are dependent they propose one to solve the sub-processes individually and store the answer in a table and there after using the recorded answers to answer the initial problem.

In any model that contains hidden variables (such as HMMs), the task of determining the sequence of variables is known as decoding (Read, 2011) where the aim is to determine the most likely sequence of the hidden states from the observed sequence (Blunsom, 2004). The Viterbi algorithm has been used to find the single best state sequence for an observed sequence and does so fashion one cannot help but admire.

Viterbi Algorithm

This algorithm follow pretty much a four step process within which at the end the most likely transition process will be derived.

1. Initialisation

$={\pi}_{i}{b}_{i}({o}_{i}),\quad 1{\leq}i{\leq}N,\quad {\Psi}_{1}(0)=0$

Where ${\pi}_{i}$ is the emission probability and ${b}_{i}$ the probability of observing ${o}_{i}$.

The ${\pi}_{i}$ may be given by some density function an exponential distribution with the transition rates as parameters and ${b}_{i}({o}_{i})$ is some probability transition matrix similar to those discussed above though most times it is assumed that matrix is time homogeneous.

1. Recursion

$\delta_{t}(i)=\max_{1{\leq}i{\leq}N}\left[\delta_{t}(i){a}_{ij}\right]{b}_{j},\quad 2{\leq}t{\leq}T,\quad 1{\leq}j{\leq}N$

$\psi_{t}(j)=\arg\max_{1{\leq}i{\leq}N}\left[\delta_{t-1}(i){a}_{ij}\right],\quad 2{\leq}t{\leq}T,\quad 1{\leq}j{\leq}N$

1. Termination

${P}^{*}=\max_{1{\leq}i{\leq}N}\left[\delta_{t}(i)\right]$

${q}_{T}^{*}=\arg\max_{1{\leq}i{\leq}N}\left[\delta_{t}(i)\right]$

Once all transition in the training set the have been accounted for the code will extract the most likely (max probability) event given all the previous events.

1. Optimal state sequence backtracking

${q}_{t}^{*}={\psi }_{t+1}\left({q}_{t+1}^{*}\right),\quad t=T-1,T-2,{\dots},1$

The backtracking allows the best state sequence to be found (Blunsom, 2004) from the results from the recursive steps.

Of interest is: given the algorithms ability to obtain the most likely sequence, there is no easy way of obtaining the second best sequence.

Worked example

Bhar and Hamori use HMM to analyse stock market returns among the G7 countries using monthly ruturns (2004, p. 43).  This came about when the pair realised that the USA exerts a strong influence on other G7 countries but find it interesting that the others does not have little of no impact on the US.

Let the model of the entire G7 monthly return be given by a Two-State Markov model of the form

${y}_{t}-{\mu}_{{S}_{t}}={\text{\o}}_{1}({y}_{t-1}-{\mu}_{{S}_{t-1}})+\varepsilon_{t}$

The returns are spit into times when volatilities high and low. The estimated probability of a country remaining in its current state is given by ${p}_{11}$ when there is low volatility and ${p}_{22}$ if high volatility prevails.

The table below show the empirical findings from Bhar and Hamori in relation to the seven countries. Given the smoothed probabilities and parameter estimates from the Baum-Welch algorithms and Viterbi algorithms we have:

These statistics are not there just to hypothesis models; their main aim is to give more information about the data thus helping decision makers strategize efficiently. Some of the information obtained in this table:

• Japan has the highest propensity to stay in a volatile regime for the longest time and the US is likely to leave volatile periods quicker.
• Italy may obtain more returns on average in a more volatile regime
• The ${\text{\o}}_{1}$ of the USA is the largest and as such US’s returns increase at the highest rate

As an exercise, the reader may take an even closer look to observe what are inferences can be drawn from the data.

Bibliography

Bhar, R., & Hamori, S. (2004). Hidden Markov Models. London: Kluwer Academic Publshers.

Blunsom, P. (2004). Hidden Markov Models.

Read, J. (2011). Hidden Markov Models and Dynamic Programming.

van Handel, R. (2008). Hidden Markov Models: Lecture notes.

0 replies