MixedModels Formula Interface and Categorical Variables

I made some more progress on my Google Summer of Code project MixedModels. The linear mixed models fitting method is now capable of handling non-numeric (i.e., categorical) predictor variables, as well as interaction effects. Moreover, I gave the method a user friendly R-formula-like interface. I will present these new capabilities of the Ruby gem with an example. Then I will briefly describe their implementation.


Data and mathematical model formulation

The data is supplied to the model fitting method LMM#from_formula as a Daru::DataFrame (from the excellent Ruby gem daru). In order to test LMM#from_formula, I have generated a data set of the following form:

> alien_species.head
#<Daru::DataFrame:70197332524760 @name = 1cd9d732-526b-49ae-8cb1-35cd69541c87 @size = 10>
                  Age Aggression   Location    Species 
         0     204.95 877.542420     Asylum      Dalek 
         1      39.88 852.528392  OodSphere WeepingAng 
         2     107.34 388.791416     Asylum      Human 
         3     210.01 170.010124  OodSphere        Ood 
         4     270.22 1078.31219  OodSphere      Dalek 
         5     157.65 164.924992  OodSphere        Ood 
         6     136.15 865.838374  OodSphere WeepingAng 
         7     241.31 1052.36035      Earth      Dalek 
         8      86.84 -8.5725199     Asylum        Ood 
         9      206.7 1070.71900  OodSphere      Dalek 

As we can see, the data set contains two numeric variables Age and Aggression, and two categorical variables Location and Species. These data are available for 100 individuals.

We model the Aggression level of an individual as a linear function of the Age (Aggression decreases with Age), with a different constant added for each Species (i.e. each species has a different base level of aggression). Moreover, we assume that there is a random fluctuation in Aggression due to the Location that an individual is at. Additionally, there is a random fluctuation in how Age affects Aggression at each different Location.

Thus, the Aggression level of an individual of Species $spcs$ who is at the Location $lctn$ can be expressed as:

where $\epsilon$ is a random residual, and the random vector $(b\subscript{lctn,0}, b\subscript{lctn,1})^T$ follows a multivariate normal distribution (the same distribution but different realizations of the random vector for each Location). That is, we have a linear mixed model with fixed effects $\beta\subscript{0}, \beta\subscript{1}, \gamma\subscript{Dalek}, \gamma\subscript{Ood}, \dots$ and random effects $b\subscript{Asylum,0}, b\subscript{Asylum,1}, b\subscript{Earth,0},\dots$.

Model fit

We fit this model in Ruby using MixedModels with:

model_fit = LMM.from_formula(formula: "Aggression ~ Age + Species + (Age | Location)", 
                             data: alien_species)

where the argument formula takes in a String that contains a formula written in the formula language that is used in the R-package lme4 (MixedModels currently supports most of the formula language except some shortcuts). Since lme4 is currently the most commonly used package for linear mixed models, a lot of documentation and tutorials to the formula interface can be found online.

We print some of the results that we have obtained:

puts "REML criterion: \t#{model_fit.dev_optimal}"
puts "Fixed effects:"
puts model_fit.fix_ef
puts "Standard deviation: \t#{Math::sqrt(model_fit.sigma2)}"

Which produces the output:

REML criterion: 	333.71553910151437
Fixed effects:
{"x0"=>1016.2867207023437, "x1"=>-0.06531615342788923, 
"x2"=>-499.69369529020815, "x3"=>-899.569321353576, 
Standard deviation: 	0.9745169802141329

Comparison with R lme4

We fit the same model in R using the package lme4, and print out the estimates for the same quantities as previously:

> mod <- lmer(Aggression ~ Age + Species + (Age | Location), data = alien.species)
> REMLcrit(mod)
[1] 333.7155
> fixef(mod)
        (Intercept)                 Age        SpeciesHuman          SpeciesOod 
      1016.28672021         -0.06531615       -499.69369521       -899.56932076 
> sigma(mod)
[1] 0.9745324

We observe that the parameter estimates from Ruby and R agree up to at least four digits behind the floating point.

A brief comment on the implementation

Categorical predictor variables

If a predictor variable is categorical and no intercept term or other categorical variables are included in the design matrix, then the design matrix must contain a column of zeros and ones for each different level of the categorical variable. If the design matrix includes an intercept term or already contains another set of 0-1-indicators for a categorical variable, then one of the levels of the categorical variable, that we want to add to the model, must be excluded (or other so-called contrasts can be used).

In the current implementation of MixedModels this is handled by the method Daru::DataFrame::create_indicator_vectors_for_categorical_vectors! (defined here). It adds a set of 0-1-valued vectors for each non-numeric vector in the Daru::DataFrame:

> df = Daru::DataFrame.new([(1..7).to_a, ['a','b','b','a','c','d','c']],
                           order: ['int','char']) 
> df.create_indicator_vectors_for_categorical_vectors!
  # => <Daru::DataFrame:70212314363900 @name = 1a2a49d9-35d3-4adf-a993-5266d7d79442 @size = 7>
             int       char char_lvl_b char_lvl_c char_lvl_d 
    0          1          a        0.0        0.0        0.0 
    1          2          b        1.0        0.0        0.0 
    2          3          b        1.0        0.0        0.0 
    3          4          a        0.0        0.0        0.0 
    4          5          c        0.0        1.0        0.0 
    5          6          d        0.0        0.0        1.0 
    6          7          c        0.0        1.0        0.0

(where it didn’t add a vector for level β€œa” of β€œchar”, because it assumes a model with intercept by default)

After the data frame is extended, LMM#from_daru checks which of the specified terms are non-numeric, and replaces them with the names of the 0-1-valued indicator columns (e.g. if a fixed effects term char were defined, LMM#from_daru would replace it with char_lvl_b, char_lvl_c and char_lvl_d).

I will probably end up restructuring the current implementation, in order to better accommodate interaction effects between categorical variables…

Formula interface

LMM#from_formula takes in a String containing a formula specifying the model, for example

"z ~ x + y + x:y + (x | u)".

It transforms this formula into another String, for the above example:

"lmm_formula(:intercept) + lmm_variable(:x) + lmm_variable(:y) + lmm_variable(:x) * lmm_variable(:y) + (lmm_variable(:intercept) + lmm_variable(:x) | lmm_variable(:u)))",

adding intercept terms and wrapping all variables in lmm_variable().

The Ruby expression in the String is evaluated with eval. This evaluation uses a specially defined class LMMFormula (defined here), which overloads the +, * and | operators, in order to combine the variable names into arrays, which can be fed into LMM#from_daru. The class LMMFormula was an idea that I got from Will Levine (wlevine). In particular, the method LMMFormula#to_input_for_lmm_from_daru transforms an LMMFormula object into a number of Array, which have the form required by LMM#from_daru.

Finally, LMM#from_daru constructs the model matrices, vectors and the covariance function Proc, which are passed on to LMM#initialize that performs the actual model fit.

Written on June 17, 2015