Policy for the evaluation of policies with offline data.
The key assumption of the method is that that the original logging policy chose i.i.d. arms uniformly at random.
Take care: if the original logging policy does not change over trials, data may be used more efficiently via propensity scoring (Langford et al., 2008; Strehl et al., 2011) and related techniques like doubly robust estimation (Dudik et al., 2011).
bandit <- OfflinePolicyEvaluatorBandit(offline_data, k, d, unique = NULL, shared = NULL, randomize = TRUE)
offline_data
data.table; offline data source (required)
k
integer; number of arms (required)
d
integer; number of contextual features (required)
randomize
logical; randomize rows of data stream per simulation (optional, default: TRUE)
unique
integer vector; index of disjoint features (optional)
shared
integer vector; index of shared features (optional)
new(offline_data, k, d, unique = NULL, shared = NULL, randomize = TRUE)
generates
and instantializes a new OfflinePolicyEvaluatorBandit
instance.
get_context(t)
argument:
t
: integer, time step t
.
list
containing the current d x k
dimensional matrix context$X
,
the number of arms context$k
and the number of features context$d
.get_reward(t, context, action)
arguments:
t
: integer, time step t
.
context
: list, containing the current context$X
(d x k context matrix),
context$k
(number of arms) and context$d
(number of context features)
(as set by bandit
).
action
: list, containing action$choice
(as set by policy
).
list
containing reward$reward
and, where computable,
reward$optimal
(used by "oracle" policies and to calculate regret).post_initialization()
Randomize offline data by shuffling the offline data.table before the start of each individual simulation when self$randomize is TRUE (default)
Agrawal, R. (1995). The continuum-armed bandit problem. SIAM journal on control and optimization, 33(6), 1926-1951.
Core contextual classes: Bandit
, Policy
, Simulator
,
Agent
, History
, Plot
Bandit subclass examples: BasicBernoulliBandit
, ContextualLogitBandit
, OfflinePolicyEvaluatorBandit
Policy subclass examples: EpsilonGreedyPolicy
, ContextualThompsonSamplingPolicy
# NOT RUN { ## generate random policy log and save it context_weights <- matrix( c( 0.9, 0.1, 0.1, 0.1, 0.9, 0.1, 0.1, 0.1, 0.9), nrow = 3, ncol = 3, byrow = TRUE) horizon <- 2000L simulations <- 1L bandit <- ContextualBernoulliBandit$new(weights = context_weights) # For the generation of random data choose a random policy, # otherwise rejection sampling will produce biased results. policy <- RandomPolicy$new() agent <- Agent$new(policy, bandit) simulation <- Simulator$new( agent, horizon = horizon, simulations = simulations, save_context = TRUE ) random_offline_data <- simulation$run() random_offline_data$save("log.RData") ## use saved log to evaluate policies with OfflinePolicyEvaluatorBandit history <- History$new() history$load("log.RData") log_S <- history$get_data_table() bandit <- OfflinePolicyEvaluatorBandit$new(offline_data = log_S, k = 3, d = 3) agents <- list( Agent$new(EpsilonGreedyPolicy$new(0.01), bandit), Agent$new(LinUCBDisjointPolicy$new(0.6), bandit) ) simulation <- Simulator$new( agents, horizon = horizon, simulations = simulations, t_over_sims = TRUE, do_parallel = FALSE, reindex = TRUE ) li_bandit_history <- simulation$run() plot(after, regret = FALSE, type = "cumulative", rate = TRUE) if (file.exists("log.RData")) file.remove("log.RData") # }