R/policy_cmab_ts.R
ContextualThompsonSamplingPolicy.Rd
ContextualThompsonSamplingPolicy
implements Thompson Sampling with Linear
Payoffs, following Agrawal and Goyal (2011).
Thompson Sampling with Linear Payoffs is a contextual Thompson Sampling multi-armed bandit
Policy which assumes the underlying relationship between rewards and contexts
are linear. Check the reference for more details.
policy <- ContextualThompsonSamplingPolicy$new(delta, R, epsilon)
delta
numeric; 0 < delta
<= 1.
With probability 1 - delta, ContextualThompsonSamplingPolicy satisfies the theoretical regret bound.
R
numeric; R
>= 0.
Assume that the residual \(ri(t) - bi(t)^T \hat{\mu}\) is R-sub-gaussian.
In this case, \(R^2\) represents the variance for residuals of the linear model \(bi(t)^T\).
epsilon
numeric; 0 < epsilon
< 1
If the total trials T is known, we can choose epsilon = 1/ln(T).
new(...)
instantiates a new ContextualThompsonSamplingPolicy
instance.
Arguments defined in the Arguments section above.
set_parameters(context_params)
initialization of policy parameters, utilising context_params$k
(number of arms) and
context_params$d
(number of context features).
get_action(t,context)
selects an arm based on self$theta
and context
, returning the index of the selected arm
in action$choice
. The context argument consists of a list with context$k
(number of arms),
context$d
(number of features), and the feature matrix context$X
with dimensions
\(d \times k\).
set_reward(t, context, action, reward)
updates parameter list theta
in accordance with the current reward$reward
,
action$choice
and the feature matrix context$X
with dimensions
\(d \times k\). Returns the updated theta
.
Shipra Agrawal, and Navin Goyal. "Thompson Sampling for Contextual Bandits with Linear Payoffs." Advances in Neural Information Processing Systems 24. 2011.
Implementation follows linthompsamp from https://github.com/ntucllab/striatum
Core contextual classes: Bandit
, Policy
, Simulator
,
Agent
, History
, Plot
Bandit subclass examples: BasicBernoulliBandit
, ContextualLogitBandit
,
OfflineReplayEvaluatorBandit
Policy subclass examples: EpsilonGreedyPolicy
, ContextualThompsonSamplingPolicy
# NOT RUN { horizon <- 1000L simulations <- 10L bandit <- ContextualLogitBandit$new(k = 5, d = 5) delta <- 0.5 R <- 0.01 epsilon <- 0.5 policy <- ContextualThompsonSamplingPolicy$new(delta, R, epsilon) agent <- Agent$new(policy, bandit) simulation <- Simulator$new(agents, horizon, simulations) history <- simulation$run() plot(history, type = "cumulative", regret = FALSE, rate = TRUE) # }