ContextualThompsonSamplingPolicy implements Thompson Sampling with Linear Payoffs, following Agrawal and Goyal (2011). Thompson Sampling with Linear Payoffs is a contextual Thompson Sampling multi-armed bandit Policy which assumes the underlying relationship between rewards and contexts are linear. Check the reference for more details.

Usage

policy <- ContextualThompsonSamplingPolicy$new(delta, R, epsilon)

Arguments

delta

numeric; 0 < delta <= 1. With probability 1 - delta, ContextualThompsonSamplingPolicy satisfies the theoretical regret bound.

R

numeric; R >= 0. Assume that the residual \(ri(t) - bi(t)^T \hat{\mu}\) is R-sub-gaussian. In this case, \(R^2\) represents the variance for residuals of the linear model \(bi(t)^T\).

epsilon

numeric; 0 < epsilon < 1 If the total trials T is known, we can choose epsilon = 1/ln(T).

Methods

new(...)

instantiates a new ContextualThompsonSamplingPolicy instance. Arguments defined in the Arguments section above.

set_parameters(context_params)

initialization of policy parameters, utilising context_params$k (number of arms) and context_params$d (number of context features).

get_action(t,context)

selects an arm based on self$theta and context, returning the index of the selected arm in action$choice. The context argument consists of a list with context$k (number of arms), context$d (number of features), and the feature matrix context$X with dimensions \(d \times k\).

set_reward(t, context, action, reward)

updates parameter list theta in accordance with the current reward$reward, action$choice and the feature matrix context$X with dimensions \(d \times k\). Returns the updated theta.

References

Shipra Agrawal, and Navin Goyal. "Thompson Sampling for Contextual Bandits with Linear Payoffs." Advances in Neural Information Processing Systems 24. 2011.

Implementation follows linthompsamp from https://github.com/ntucllab/striatum

See also

Core contextual classes: Bandit, Policy, Simulator, Agent, History, Plot

Bandit subclass examples: BasicBernoulliBandit, ContextualLogitBandit, OfflineReplayEvaluatorBandit

Policy subclass examples: EpsilonGreedyPolicy, ContextualThompsonSamplingPolicy

Examples

# NOT RUN {
horizon            <- 1000L
simulations        <- 10L

bandit             <- ContextualLogitBandit$new(k = 5, d = 5)

delta              <- 0.5
R                  <- 0.01
epsilon            <- 0.5

policy             <- ContextualThompsonSamplingPolicy$new(delta, R, epsilon)

agent              <- Agent$new(policy, bandit)

simulation         <- Simulator$new(agents, horizon, simulations)
history            <- simulation$run()

plot(history, type = "cumulative", regret = FALSE, rate = TRUE)
# }