Title: | Sparse-Group Boosting |
---|---|
Description: | Sparse-group boosting to be used in conjunction with the 'mboost' for modeling grouped data. Applicable to all sparse-group lasso type problems where within-group and between-group sparsity is desired. Interprets and visualizes individual variables and groups. |
Authors: | Fabian Obster [aut, cre, cph] |
Maintainer: | Fabian Obster <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.4 |
Built: | 2024-11-16 03:05:09 UTC |
Source: | https://github.com/fabianobster/sgboost |
Creates a mboost
formula that allows to fit a sparse-group boosting model based on
boosted Ridge Regression with mixing parameter alpha
. The formula consists of a
group baselearner part with degrees of freedom
1-alpha
and individual baselearners with degrees of freedom alpha
.
Groups should be defined through group_df
. The corresponding modeling data
should not contain categorical variables with more than two categories,
as they are then treated as a group only.
create_formula( alpha = 0.3, group_df = NULL, blearner = "bols", outcome_name = "y", group_name = "group_name", var_name = "var_name", intercept = FALSE )
create_formula( alpha = 0.3, group_df = NULL, blearner = "bols", outcome_name = "y", group_name = "group_name", var_name = "var_name", intercept = FALSE )
alpha |
Numeric mixing parameter. For alpha = 0 only group baselearners and for alpha = 1 only individual baselearners are defined. |
group_df |
input data.frame containing variable names with group structure. |
blearner |
Type of baselearner. Default is |
outcome_name |
String indicating the name of dependent variable. Default is |
group_name |
Name of column in group_df indicating the group structure of the variables.
Default is |
var_name |
Name of column in group_df containing the variable names
to be used as predictors. Default is |
intercept |
Logical, should intercept be used? |
Character containing the formula to be passed to mboost::mboost()
yielding the sparse-group boosting for a given value mixing parameter alpha
.
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- create_formula(alpha = 0.3, group_df = group_df) sgb_model <- mboost(formula = sgb_formula, data = df) summary(sgb_model)
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- create_formula(alpha = 0.3, group_df = group_df) sgb_model <- mboost(formula = sgb_formula, data = df) summary(sgb_model)
Computes the aggregated coefficients from group and individual baselearners. Also returns the raw coefficients associated with each baselearner.
get_coef(sgb_model)
get_coef(sgb_model)
sgb_model |
Model of type |
in a sparse group boosting models a variable in a dataset can be selected as an individual variable or as a group. Therefore there can be two associated effect sizes for the same variable. This function aggregates both and returns it in a data.frame.
List of data.frames containing the a data.frame '$raw'
with the
variable and the raw (Regression) coefficients and the data.frame '$aggregated'
with the
aggregated (Regression) coefficients.
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- create_formula(alpha = 0.3, group_df = group_df) sgb_model <- mboost(formula = sgb_formula, data = df) sgb_coef <- get_coef(sgb_model)
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- create_formula(alpha = 0.3, group_df = group_df) sgb_model <- mboost(formula = sgb_formula, data = df) sgb_coef <- get_coef(sgb_model)
Computes the aggregated coefficients from group and individual baselearners for each boosting iteration.
get_coef_path(sgb_model)
get_coef_path(sgb_model)
sgb_model |
Model of type |
in a sparse-group boosting models a variable in a dataset can be selected as an individual variable or as a group. Therefore there can be two associated effect sizes for the same variable. This function aggregates both and returns it in a data.frame for each boosting iteration
List of data.frames containing the a data.frame $raw
with the
variable and the raw (Regression) coefficients and the data.frame $aggregated
with the
aggregated (Regression) coefficients.
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- create_formula(alpha = 0.3, group_df = group_df) sgb_model <- mboost(formula = sgb_formula, data = df) sgb_coef_path <- get_coef_path(sgb_model)
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- create_formula(alpha = 0.3, group_df = group_df) sgb_model <- mboost(formula = sgb_formula, data = df) sgb_coef_path <- get_coef_path(sgb_model)
Variable importance is computed as relative reduction of loss-function attributed to each predictor (groups and individual variables). Returns a list of two data.frames. The first contains the variable importance of a sparse-group model in a data.frame for each predictor. The second one contains the aggregated relative importance of all groups vs. individual variables.
get_varimp(sgb_model)
get_varimp(sgb_model)
sgb_model |
Model of type |
List of two data.frames. $raw
contains the name of the variables, group structure and
variable importance on both group and individual variable basis.
$group_importance
contains the the aggregated relative importance of all
group baselearners and of all individual variables.
mboost::varimp()
which this function uses.
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- as.formula(create_formula(alpha = 0.3, group_df = group_df)) sgb_model <- mboost(formula = sgb_formula, data = df) sgb_varimp <- get_varimp(sgb_model)
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- as.formula(create_formula(alpha = 0.3, group_df = group_df)) sgb_model <- mboost(formula = sgb_formula, data = df) sgb_varimp <- get_varimp(sgb_model)
Radar or scatter/lineplot visualizing the effects sizes relative to the variable importance in a sparse-group boosting model. Works also for a regular mboost model.
plot_effects( sgb_model, plot_type = "radar", prop = 0, n_predictors = 30, max_char_length = 5, base_size = 8 )
plot_effects( sgb_model, plot_type = "radar", prop = 0, n_predictors = 30, max_char_length = 5, base_size = 8 )
sgb_model |
Model of type |
plot_type |
String indicating the type of visualization to use.
|
prop |
Numeric value indicating the minimal importance a predictor/baselearner has to have to be plotted.
Default value is zero, meaning all predictors are plotted. By increasing prop the number of
plotted variables can be reduced. One can also use |
n_predictors |
The maximum number of predictors to be plotted. Default is 30.
Alternative to |
max_char_length |
The maximum character length of a predictor to be printed. Default is 5. For long variable names one may adjust this number. |
base_size |
The |
ggplot2
object mapping the effect sizes and variable importance.
get_coef()
, get_varimp()
which this function uses.
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- as.formula(create_formula(alpha = 0.3, group_df = group_df)) sgb_model <- mboost(formula = sgb_formula, data = df) plot_effects(sgb_model)
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- as.formula(create_formula(alpha = 0.3, group_df = group_df)) sgb_model <- mboost(formula = sgb_formula, data = df) plot_effects(sgb_model)
Shows how the effect sizes change throughout the boosting iterations in a sparse-group boosting model. Works also for a regular mboost models. Color indicates the selection of group or individual variables within a boosting iteration.
plot_path(sgb_model, max_char_length = 5, base_size = 8)
plot_path(sgb_model, max_char_length = 5, base_size = 8)
sgb_model |
Model of type |
max_char_length |
The maximum character length of a predictor to be printed. Default is 5. For long variable names one may adjust this number. |
base_size |
The |
ggplot2
object mapping the effect sizes and variable importance.
get_coef_path()
which this function uses.
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- as.formula(create_formula(alpha = 0.4, group_df = group_df)) sgb_model <- mboost(formula = sgb_formula, data = df) plot_path(sgb_model)
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- as.formula(create_formula(alpha = 0.4, group_df = group_df)) sgb_model <- mboost(formula = sgb_formula, data = df) plot_path(sgb_model)
Visualizes the variable importance of a sparse-group boosting model. Color indicates if a predictor is an individual variable or a group.
plot_varimp( sgb_model, prop = 0, n_predictors = 30, max_char_length = 15, base_size = 8 )
plot_varimp( sgb_model, prop = 0, n_predictors = 30, max_char_length = 15, base_size = 8 )
sgb_model |
Model of type |
prop |
Numeric value indicating the minimal importance a predictor/baselearner has to have.
Default value is zero, meaning all predictors are plotted. By increasing prop the number of
plotted variables can be reduced. One can also use |
n_predictors |
The maximum number of predictors to be plotted. Default is 30.
Alternative to |
max_char_length |
The maximum character length of a predictor to be printed. Default is 15. For larger groups or long variable names one may adjust this number to differentiate variables from groups. |
base_size |
The |
Note that aggregated group and individual variable importance printed in the legend is based only on the plotted variables and not on all variables that were selected in the sparse-group boosting model.
object of type ggplot2
.
get_varimp which this function uses.
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- as.formula(create_formula(alpha = 0.3, group_df = group_df)) sgb_model <- mboost(formula = sgb_formula, data = df) sgb_varimp <- plot_varimp(sgb_model)
library(mboost) library(dplyr) set.seed(1) df <- data.frame( x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), x5 = runif(100) ) df <- df %>% mutate_all(function(x) { as.numeric(scale(x)) }) df$y <- df$x1 + df$x4 + df$x5 group_df <- data.frame( group_name = c(1, 1, 1, 2, 2), var_name = c("x1", "x2", "x3", "x4", "x5") ) sgb_formula <- as.formula(create_formula(alpha = 0.3, group_df = group_df)) sgb_model <- mboost(formula = sgb_formula, data = df) sgb_varimp <- plot_varimp(sgb_model)