dynamic pricing without knowing the demand function: risk ......dynamic pricing without knowing the...

Dynamic Pricing Without Knowing the Demand Function:

Risk Bounds and Near-Optimal Algorithms∗

Omar Besbes†

University of Pennsylvania

Assaf Zeevi‡

Columbia University

Submitted: 11/2006, Revised 6/2007, 12/2007

To appear in Operations Research

Abstract

We consider a single product revenue management problem where, given an initial inventory,

the objective is to dynamically adjust prices over a finite sales horizon to maximize expected

revenues. Realized demand is observed over time, but the underlying functional relationship

between price and mean demand rate that governs these observations (otherwise known as the

demand function or demand curve), is not known. We consider two instances of this problem:

i.) a setting where the demand function is assumed to belong to a known parametric family

with unknown parameter values; and ii.) a setting where the demand function is assumed to

belong to a broad class of functions that need not admit any parametric representation. In each

case we develop policies that learn the demand function “on the fly,” and optimize prices based

on that. The performance of these algorithms is measured in terms of the regret: the revenue

loss relative to the maximal revenues that can be extracted when the demand function is known

prior to the start of the selling season. We derive lower bounds on the regret that hold for any

admissible pricing policy, and then show that our proposed algorithms achieve a regret that is

“close” to this lower bound. The magnitude of the regret can be interpreted as the economic

value of prior knowledge on the demand function; manifested as the revenue loss due to model

uncertainty.

Keywords: Revenue management, pricing, estimation, learning, exploration-exploitation, value

of information, asymptotic analysis.

∗Research partially supported by NSF grant DMI-0447562†The Wharton School, e-mail: [email protected]‡Graduate School of Business, e-mail: [email protected]

1

1 Introduction

Since the deregulation of the airline industry in the 1970’s, revenue management practices that

deal with complex pricing and demand management decisions have become increasingly prevalent

in a variety of industries; see Bitran and Caldentey (2003), Elmaghraby and Keskinocak (2003)

and Talluri and van Ryzin (2005) for a broad overview of the field and recent developments. (The

literature review in Section 2 provides a detailed survey of work directly related to the current

paper.) A critical assumption made in most academic studies of revenue management problems

is that the functional relationship between the mean demand rate and price, often referred to as

the demand function or demand curve, is known to the decision maker. This makes the underlying

problems more tractable and allows one to extract structural insights. At the same time, this

assumption of “full information” endows the decision maker with knowledge that s/he does not

typically possess in practice.

Lack of information concerning the demand model raises several fundamental questions. First,

and foremost, is it possible to quantify the “value” of full information (for example, by measuring

the revenue loss due to imperfect information)? Second, is it possible to achieve anything close to

the maximal revenues in the full information setting by judiciously combining real-time demand

learning and pricing strategies? Finally, how would such strategies exploit prior information, if any,

on the structure of the demand function?

The main objective of this paper is to shed some light on the aforementioned questions. Our

departure point will be a prototypical single product revenue management problem first introduced

and formalized by Gallego and van Ryzin (1994). This formulation models realized demand as a

Poisson process whose intensity at each point in time is determined by a price set by the decision

maker. Given an initial inventory, the objective is to dynamically price the product so as to

maximize expected revenues over a finite selling horizon. In the dynamic optimization problem

considered in Gallego and van Ryzin (1994), the decision maker knows the demand function prior

to the start of the selling season and designs optimal policies based on this information. In the

setting we pursue in this paper it is only possible to observe realized demand over time, and the

demand function itself is not known. To that end, we consider two levels of uncertainty with regard

to the demand model: i.) a nonparametric setting where the demand function is only assumed to

belong to a broad functional class satisfying mild regularity conditions; and ii.) a parametric setting

in which the demand function admits a given parametric structure but the parameter values are

not known.

The absence of perfect prior information concerning the demand model introduces an important

new component into the above dynamic optimization problem, namely, tension between exploration

(demand learning) and exploitation (pricing). The longer one spends learning the demand charac-

teristics, the less time remains to exploit that knowledge and optimize profits. On the other hand,

2

less time spent on demand learning leaves more residual uncertainty that could hamper any pricing

strategy in the exploitation phase. One of the main contributions of this paper is to formulate this

dynamic pricing problem under incomplete information, and to pursue an analysis that highlights

the key tradeoffs discussed above, and articulate them in a precise mathematical manner.

To address uncertainty with regard to the demand model, we introduce a family of pricing

policies that learn the demand function “on the fly.” Their performance will be measured in terms

of the revenue loss relative to a full information benchmark that assumes knowledge of the demand

function. We refer to this loss as the regret associated with not knowing the demand function a

priori; the magnitude of the regret quantifies the economic value of prior model information. The

policies we consider are designed with the objective of achieving a “small” regret uniformly over the

relevant class of demand functions (either parametric or nonparametric). This adversarial setting,

where nature is allowed to counter a chosen policy with the “worst” demand function, ensures that

policies exhibit “good” performance irrespective of the true demand model.

The complexity of the problem described above makes it difficult to evaluate the performance

of any reasonable policy, except via numerical experiments. To address this issue, we consider an

asymptotic regime which is characterized by a high volume of sales. More specifically, the initial

level of inventory and the magnitude of demand (“market size”) grow large in proportion to each

other; see Gallego and van Ryzin (1994) and Talluri and van Ryzin (2005) for further examples in

the revenue management literature that adopt this framework. This regime allows us to bound the

magnitude of the regret, and to establish a rather surprising result with regard to our proposed

policies: as the sales volume grows large, the regret eventually shrinks to zero. That is, these

policies achieve (asymptotically) the maximal full information revenues, despite the absence of

prior information regarding the demand function; in that sense, they are asymptotically optimal.

In more detail, the main contributions of this paper are summarized as follows.

i.) We introduce a nonparametric pricing policy (see Algorithm 1) that requires almost no prior

information on the demand function. In settings where the structure of the demand function

is known up to the value/s of certain parameter/s, we develop a parametric pricing policy

based on Maximum Likelihood estimation (see Algorithm 2 and Algorithm 3).

ii.) We establish lower bounds on the regret that hold for any admissible learning and pricing

policy (see Propositions 2 and 4).

iii.) We derive upper bounds on the performance of our nonparametric and parametric pricing

policies (see Propositions 1 and 3). In all cases the proposed policies achieve a regret that

is “not far” from the lower bound described above. In the parametric setting when only one

parameter is unknown, we prove that essentially no admissible pricing policy can achieve a

smaller regret than our proposed method (see Proposition 5).

3

iv.) Building on ideas from stochastic approximations, we indicate how one can develop more

refined sequential policies, and illustrate this in the nonparametric setting (see Algorithm 4).

Returning to the questions raised earlier in this section, our results shed light on the following

issues. First, despite having only limited (or almost no) prior information, it is possible to construct

joint learning and pricing policies that generate revenues which are “close” to the best achievable

performance with full information. Our results highlight an interesting observation. In the full

information setting, Gallego and van Ryzin (1994) prove that fixed price heuristics lead to near-

optimal revenues, hence the value of dynamic price changes is (at least asymptotically) limited. In

a setting with incomplete information, price changes play a much more pivotal role, as they are

relied upon to resolve uncertainty with regard to the demand function.

The regret bounds described above rigorously quantify the economic value of a priori information

on the demand model. Alternatively, the lost revenues can be viewed as quantifying the “price”

paid for model uncertainty. Finally, our work highlights an important issue related to model

misspecification risk. In particular, if an algorithm is designed under parametric assumptions, it is

prone to such risk as the true demand function may not (and in many cases will not) belong to the

assumed parametric family. Our regret bounds provide a means for quantifying the “price” that

one pays for eliminating this risk via nonparametric approaches; see also the numerical illustration

in Section 6.

The remainder of the paper. The next section reviews related literature. Section 3 introduces

the model and formulates the problem. Section 4 studies the nonparametric setting and Section 5

focuses on cases where the demand function possesses a parametric structure. Section 6 presents

numerical results and discusses some qualitative insights. Section 7 formulates adaptive versions

of the nonparametric algorithm and illustrates their performance. All proofs are collected in two

appendices: Appendix A contains the proofs of the main results; and Appendix B contains proofs

of auxiliary lemmas.

2 Related Literature

Parametric approaches. The majority of revenue management studies that address demand

function uncertainty do so by assuming that one or more parameters characterizing this function are

unknown. The typical approach here follows a dynamic programming formulation with Bayesian

updating, where a prior on the distribution of the unknown parameters is initially postulated.

Recent examples include Aviv and Pazgal (2005), Araman and Caldentey (2005) and Farias and

Van Roy (2006), all of which assume a single parameter is unknown. (See also Lobo and Boyd

(2003) and Carvalho and Puterman (2005).) Scarf (1959) was one of the first papers to use this

Bayesian formulation, though in the context of inventory management.

4

While the Bayesian approach provides for an attractive stylized analysis of the joint learning

and pricing problem, it suffers from significant shortcomings. Most notably, the objective of the

dynamic optimization problem involves an expectation that is taken relative to a prior distribution

over the unknown parameters. Hence any notion of optimality associated with a Bayesian-based

policy is with respect to that prior. Moreover, the specification of this prior distribution is typically

constrained to so-called conjugate families and is not driven by “real” prior information; the hin-

dering element here is the computation of the posterior via Bayes rule. The above factors introduce

significant restrictions on the models that are amenable to analysis via Bayesian dynamic program-

ming. Bertsimas and Perakis (2003) have considered an alternative to this formulation using a least

squares approach in the context of a linear demand model. Our work in the parametric setting

is based on maximum likelihood estimation, and hence is applicable to a wide class of parametric

models; as such, it offers an alternative to Bayesian approaches that circumvents some of their

deficiencies.

Nonparametric approaches. The main difficulty facing nonparametric approaches is loss of

tractability. Most work here has been pursued in relatively simple static settings that do not allow

for learning of the demand function; see, e.g., Ball and Queyranne (2006) and Eren and Maglaras

(2007) for a competitive ratio formulation, and Perakis and Roels (2006) for a minimax regret

formulation. These studies focus almost exclusively on structural insights. The recent paper by

Rusmevichientong et al. (2006) develops a nonparametric approach to a multiproduct static pricing

problem, based on historical data. The formulation does not incorporate inventory constraints

or finite sales horizon considerations. In the context of optimizing seat allocation policies for a

single flight multi-class problem, van Ryzin and McGill (2000) show that one can use stochastic

approximation methods to reach near-optimal capacity protection levels in the long run. The

absence of parametric assumptions in this case is with respect to the distributions of customers’

requests for each class. (See also Huh and Rusmevichientong (2006) for a related study.)

Perhaps the most closely related paper to our current work is that of Lim and Shanthikumar

(2006) who formulate a robust counterpart to the single product revenue management problem of

Gallego and van Ryzin (1994). In that paper the uncertainty arises at the level of the point process

distribution characterizing realized demand, and the authors use a max-min formulation where

nature is adversarial at every point in time. This type of conservative setting effectively precludes

any real-time learning, and moreover does not lend itself to prescriptive solutions.

Related work in other disciplines. The general problem of dynamic optimization with

limited or no information about a response function has also attracted attention in other fields. In

economics, a line of work that traces back to Hannan (1957) studies settings where the decision

maker faces an oblivious opponent. The objective is to minimize the difference between the rewards

accumulated by a given policy, and the rewards accumulated by the best possible single action had

5

the decision maker known in advance the actions of the adversary; see Foster and Vohra (1999) for

a review of this line of work and its relation to developments in other fields.

A classical formulation of sequential optimization under uncertainty that captures the essence

of the exploration-exploitation tradeoff, is the multiarmed bandit paradigm that dates back to

the work of Robbins (1952). This was originally introduced as a model of clinical trials in the

statistics literature, but has since been used in many other settings; see, e.g., Lai and Robbins

(1985) and references therein. Related studies in the computer science literature include Auer

et al. (2002) who study an adversarial version of a multi-armed bandit problem, and Kleinberg

and Leighton (2003) who provide an analysis of an on-line posted-price auction using these tools.

(See Cesa-Bianchi and Lugosi (2006) for a recent and comprehensive survey). Our work shares an

important common theme with the streams of literature survey above, insofar as it too highlights

exploration-exploitation tradeoffs. On the other hand, our work represents a significant departure

from antecedent literature along three important dimensions that are characteristic of our dynamic

pricing problem: we deal with a constrained dynamic optimization problem (the constraint arising

from the initial inventory level); the action space of the decision maker, namely the feasible price

set, is uncountable; and the action space of the adversary (nature), namely, the class of admissible

demand functions, is also uncountable.

3 Problem Formulation

Model primitives and basic assumptions. We consider a revenue management problem in

which a monopolist sells a single product. The selling horizon is denoted by T > 0, and after

this time sales are discontinued and there is no salvage value for the remaining unsold products.

Demand for the product at any time t ∈ [0, T ] is given by a Poisson process with intensity λt which

measures the instantaneous demand rate (in units such as number of products requested per hour,

say): Letting λ : R+ → R+ denote the demand function, then if the price at time t is p(t), the

instantaneous demand rate at time t is given by λt = λ(p(t)), and realized demand is a controlled

Poisson process with this intensity.

We assume that the set of feasible prices is [p, p] ∪ p∞, where 0 < p < p < ∞ and p∞ > 0 is a

price that “turns off” demand (and revenue rate), i.e., λ(p∞) = 01. With regard to the demand

function, we assume that λ(·) is non-increasing in the price p, has an inverse denoted by γ(·), and

the revenue rate r(λ) := λγ(λ) is concave. These assumptions are quite standard in the revenue

management literature resulting in the term regular affixed to demand functions satisfying these

conditions; see, e.g., Talluri and van Ryzin (2005, §7).

Let (p(t) : 0 ≤ t ≤ T ) denote the price process which is assumed to have sample paths that are

1The case p∞ = ∞ can be incorporated by assuming r(λ(p∞)) := limp→p∞r(λ(p)) = 0 which implies that

limp→p∞λ(p) = 0.

6

right continuous with left limits taking values in [p, p]∪p∞. Let N(·) be a unit rate Poisson process.

The cumulative demand for the product up until time t is then given by D(t) := N(∫ t0 λ(p(s))ds).

We say that (p(t) : 0 ≤ t ≤ T ) is non anticipating if the value of p(t) at each time t ∈ [0, T ] is only

allowed to depend on past prices {p(s) : s ∈ [0, t)} and demand values {(D(s)) : s ∈ [0, t)}. (More

formally, the price process is adapted to the filtration generated by the past values of the demand

and price processes.)

Information structure and the economic optimization problem. We assume that the

decision maker does not know the true demand function λ, but is able to continuously observe

realized demand at all time instants starting at time 0 and up until the end of the selling horizon

T . The only information available regarding λ is that it belongs to a class of admissible demand

functions, L; in Section 4, L will be taken to be a nonparametric class of functions, and in Section

5 it will be restricted to a parametric class. Thus, the makeup of the class L summarizes prior

information on the demand model.

We shall use π to denote a pricing policy, which, roughly speaking, maps the above information

structure to a non anticipating price process (p(t) : 0 ≤ t ≤ T ). With some abuse of terminology,

we will use the term “policy” to refer to the price process itself and the algorithm that generates

it, interchangeably. Put

Nπ(t) := N(∫ t

0λ(p(s))ds

)for 0 ≤ t ≤ T, (1)

where Nπ(t) denotes the cumulative demand up to time t under the policy π.

Let x > 0 denote the inventory level (number of products) at the start of the selling season. A

pricing policy π is said to be admissible if the induced price process satisfies

∫ T

0dNπ(s) ≤ x a.s., (2)

p(s) ∈ [p, p] ∪ p∞, 0 ≤ s ≤ T. (3)

It is important to note that while the decision maker does not know the demand function, knowledge

that λ(p∞) = 0 guarantees that the constraint (2) can be met. Let P denote the set of admissible

pricing policies.

The dynamic optimization problem faced by the decision maker under the information structure

described above is: choose π ∈ P to maximize the total expected revenues

Jπ(x, T ; λ) := E

[∫ T

0p(s)dNπ(s)

]. (4)

The dependence on λ in the left-hand-side is indicative of the fact that the expectation on the

right-hand-side is taken with respect to the true demand distribution. Since the decision maker

cannot compute the expectation in (4) without knowing the underlying demand function, the above

7

optimization problem does not seem to be well posed. In a sense one can view the solution of (4) as

being made possible only with the aid of an “oracle” which can compute the quantity Jπ(x, T ; λ)

for any given policy. We will now redefine the decision maker’s objective in a more suitable manner,

using the notion of a full information benchmark.

A full information benchmark. Let us first explain how the analysis of the dynamic op-

timization problem described in (4) proceeds when one removes two significant obstacles: lack

of knowledge of the demand function λ prior to the start of the selling season; and stochastic

variability in realized demand. In particular, consider the following full information deterministic

optimization problem, in which the function λ is assumed to be known at time t = 0:

sup

∫ T

0r(λ(p(s)))ds, (5)

s.t.

∫ T

0λ(p(s))ds ≤ x,

p(s) ∈ [p, p] ∪ p∞ for all s ∈ [0, T ].

This problem is obtained from (4), and the admissibility conditions (2)-(3), by replacing the random

process characterizing customer purchase requests by its mean rate. For example, if one focuses on

the objective (4), then the deterministic objective in (5) is obtained by substituting “λ(p(s))ds”

for “dNπ(s)” since r(λ(p(s))) = p(s)λ(p(s)). The same parallel can be drawn between the first

constraint of the deterministic problem and (2). Consequently, it is reasonable to refer to (5) as a

full information deterministic relaxation of the original dynamic pricing problem (4).

Let us denote the value of (5) as JD(x, T |λ) where ‘D’ is mnemonic for deterministic and the

choice of notation with respect to λ reflects the fact that the optimization problem is solved “con-

ditioned” on knowing the true underlying demand function. The value of the full information

deterministic relaxation provides, as one would anticipate, an upper bound on expected revenues

generated by any pricing policy π ∈ P, that is, Jπ(x, T ; λ) ≤ JD(x, T |λ) for all λ ∈ L; this rather

intuitive observation is formalized in Lemma 1 in Appendix A which essentially generalizes Gallego

and van Ryzin (1994, Proposition 2).

The minimax regret objective. As indicated above, for any demand function λ ∈ L, we have

that Jπ(x, T ; λ) ≤ JD(x, T |λ) for all admissible policies π ∈ P. With this in mind, we define the

regret Rπ(x, T ; λ), for any given function λ ∈ L and policy π ∈ P, to be

Rπ(x, T ; λ) = 1 −Jπ(x, T ; λ)

JD(x, T |λ). (6)

The regret measures the percentage loss in performance of any policy π in relation to the benchmark

JD(x, T |λ). By definition, the value of the regret always lies in the interval [0, 1], and the smaller

the regret, the better the performance of a policy π; in the extreme case when the regret is zero,

then the policy π is guaranteed to extract the maximum full information revenues. Since the

8

decision maker does not know which demand function s/he will face in the class L, it is attractive

to design pricing policies that perform well irrespective of the actual underlying demand function.

In particular, if the decision maker uses a policy π ∈ P, and nature then “picks” the worst possible

demand function for that policy, then the resulting regret would be

supλ∈L

Rπ(x, T ; λ). (7)

In this game theoretic setting it is now possible to restate the decision maker’s objective, initially

given in (4), as follows: pick π ∈ P to minimize (7). The advantage of this formulation is that

the decision maker’s problem is now well posed: for any π ∈ P and fixed λ ∈ L it is possible, at

least in theory, to compute the numerator on the right hand side in (6), and hence (7). Roughly

speaking, one can attach a worst case λ ∈ L to “each” policy π ∈ P, and subsequently one can

try to “optimize” this by searching for the policy with the best worst-case performance. In other

words, we are interested in characterizing the minimax regret

infπ∈P

supλ∈L

Rπ(x, T ; λ). (8)

This quantity has an obvious physical interpretation: it measures the monetary value (in normalized

currency units) of knowing the demand function a priori. The issue of course is that, barring

exceedingly simple cases, it is not possible to compute the minimax regret. Our objective in what

follows will be to characterize this quantity by deriving suitable bounds on (8) in cases where the

class L is nonparametric or is restricted to a suitable parametric class of demand functions.

4 Main Results: The Nonparametric Case

4.1 A nonparametric pricing algorithm

We introduce below a learning and pricing policy defined through two tuning parameters (κ, τ): κ

is a positive integer and τ ∈ (0, T ]. The general structure, which is summarized for convenience in

algorithmic form, is divided into two main stages. A “learning” phase (exploration) of length τ is

first used, in which κ prices are tested. Then a “pricing” phase (exploitation) fixes a “good” price

based on demand observations in the first phase. The intuition underlying the method is discussed

immediately following the description of the method.

Algorithm 1 : π(τ , κ)

Step 1. Initialization:

(a) Set the learning interval to be [0, τ ], and the number of prices to experiment with to be

κ. Put ∆ = τ/κ.

9

(b) Divide [p, p] into κ equally spaced intervals and let {pi, i = 1, ..., κ} be the left endpoints

of these intervals.

Step 2. Learning/experimentation:

(a) On the interval [0, τ ] apply pi from ti−1 = (i − 1)∆ to ti = i∆, i = 1, 2, ..., κ, as long

as inventory is positive. If no more units are in stock, apply p∞ up until time T and

STOP.

(b) Compute

d(pi) =total demand over [ti−1, ti)

∆, i = 1, ..., κ.

Step 3. Optimization:

Compute pu = arg max1≤i≤κ

{pid(pi)}, pc = arg min1≤i≤κ

∣∣d(pi) − x/T∣∣, (9)

and set p = max{pc, pu}. (10)

Step 4. Pricing:

On the interval (τ, T ] apply p as long as inventory is positive, then apply p∞ for the remaining

time.

Intuition and key underlying ideas. At first, a nonparametric empirical estimate of the

demand function is obtained based on a learning phase of length τ described in Steps 1 and 2. The

intuition underlying Steps 3 and 4 is based on the analysis of the deterministic relaxation (5) whose

solution (see Lemma 1 in Appendix A) is given by p(s) = pD := max{pu, pc} for s ∈ [0, T ′] and

p(s) = p∞ for s > T ′, where

pu = arg maxp∈[p,p]

{r(λ(p))}, pc = arg minp∈[p,p]

|λ(p) − x/T |,

and T ′ = min{T, x/λ(pD)}. Here, the superscripts “u” and “c” stand for unconstrained and

constrained, respectively, in reference to whether the inventory constraint in (5) is binding or not.

In particular, the deterministic problem (5) can be solved by restricting attention to the two prices

pu and pc. Algorithm 1 hinges on this observation.

In Step 3 the objective is to obtain an accurate estimate of pD based on the observations during

the “exploration” phase, while at the same time keeping τ “small” in order to limit the revenue

loss over this learning phase. The algorithm then applies this price on (τ, T ]. With the exception

of the “short” initial phase [0, τ ], the expected revenues (in the real system) will be close to those

achieved by pD over [0, T ]. The analysis in Gallego and van Ryzin (1994) establishes that those

revenues would be close to JD(x, T |λ), and as a result, the regret Rπ(x, T ; λ) should be small.

10

To estimate pD, the algorithm dedicates an initial portion [0, τ ] of the total selling interval [0, T ]

to an exploration of the price domain. On this initial interval, the algorithm experiments with κ

prices where each is kept fixed for τ/κ units of time. This structure leads to three main sources of

error in the search for pD. First, during the learning phase one incurs an exploration bias since the

prices being tested there are not close to pD (or close to the optimal fixed price for that matter).

This incurs losses of order τ . Second, experimenting with only a finite number of prices κ in the

search for pD results in a deterministic error of order 1/κ in Step 3. Finally, only “noisy” demand

observations are available at each of the κ price points, and the longer a price is held fixed, the

more accurate the estimate of the mean demand rate at that price. This introduces a stochastic

error of order (τ/κ)−1/2 stemming from the nature of the Poisson process. The crux of the matter

is to balance these three error sources by a suitable choice of the tuning parameters τ and κ.

4.2 Model uncertainty: the class of demand functions

The nonparametric class of functions we consider consists of regular demand function (satisfying

the standard conditions laid out in Section 3) which in addition satisfy the following.

Assumption 1 For some finite positive constants M, K, K, m, with K ≤ K:

(i.) Boundedness: |λ(p)| ≤ M for all p ∈ [p, p].

(ii.) Lipschitz continuity: |λ(p)−λ(p′)| ≤ K|p−p′| for all p, p′ ∈ [p, p] and |γ(l)−γ(l′)| ≤ K−1|l−l′|

for all l, l′ ∈ [λ(p), λ(p)].

(iii.) Minimum revenue rate: max{pλ(p) : p ∈ [p, p]} ≥ m.

Let L := L(M, K, K, m) denote this class of demand functions. Assumption 1(i.) and 1(ii.) are

quite benign, only requiring minimal smoothness of the demand function; (ii.) also ensures that

when λ(·) is positive, it does not have “flat regions.” Assumption 1(iii.) states that a minimal

revenue rate exists, hence avoiding trivialities.

Note that assumptions (i.)-(iii.) above hold for many models of the demand function used in the

revenue management and economics literature (e.g., linear, exponential and iso-elastic/Pareto with

parameters lying in a compact set; cf. Talluri and van Ryzin (2005, §7) for further examples).

4.3 Performance analysis

Since minimax regret is hardly a tractable quantity, we introduce in this section an asymptotic

regime characterized by a “high volume of sales,” which will be used to analyze the performance

of Algorithm 1. We consider a regime in which both the size of the initial inventory as well as

potential demand grow proportionally large. In particular for a market of “size” n, where n is a

11

positive integer, the initial inventory and the demand function are now assumed to be given by

xn = nx and λn(·) = nλ(·). (11)

Thus, the index n determines the order of magnitude of both inventory and rate of demand. We will

denote by Pn the set of admissible policies for a market of scale n, and the expected revenues under

a policy πn ∈ Pn will be denoted Jπn (x, T ; λ). With some abuse of notation, we will occasionally

use π to denote a sequence of policies {πn, n = 1, 2, . . .} as well as any element of that sequence,

omitting the subscript “n” to avoid cluttering the notation. For each n = 1, 2, . . ., we denote by

JDn (x, T |λ) the value of the deterministic relaxation given in (5) with the scaling given in (11); it

is straightforward to verify that JDn (x, T |λ) = nJD(x, T |λ). Finally let the regret be denoted as

Rπn(x, T ; λ) := 1−Jπ

n (x, T ; λ)/JDn (x, T |λ). The following definition characterizes admissible policies

that have “good” asymptotic properties.

Definition 1 (Asymptotic optimality) A sequence of admissible policies πn ∈ Pn is said to be

asymptotically optimal if

supλ∈L

Rπn(x, T ; λ) → 0 as n → ∞. (12)

In other words, asymptotically optimal policies achieve the full information upper bound on rev-

enues as n → ∞, uniformly over the class of admissible demand functions.

For the purpose of asymptotic analysis we use the following notation: for real valued positive

sequences {an} and {bn} we write an = O(bn) if an/bn is bounded from above by a constant, and

if an/bn is also bounded from below then we write an ≍ bn. We now analyze the performance of

policies associated with Algorithm 1 in the asymptotic regime described above.

Proposition 1 Let Assumption 1 hold. Set τn ≍ n−1/4, κn ≍ n1/4 and let πn := π(τn, κn) be given

by Algorithm 1. Then, the sequence {πn} is asymptotically optimal, and for all n ≥ 1

supλ∈L

Rπn(x, T ; λ) ≤

C(log n)1/2

n1/4, (13)

for some finite positive constant C.

The constant C above depends only on the parameters characterizing the class L, the initial in-

ventory x, and the time horizon T . The exact dependence is somewhat complex and is omitted,

however, we note that it is fully consistent with basic intuition: as one expands the class L by

suitably increasing or decreasing the value of the parameters in Assumption 1, the magnitude

of C grows, and vice versa. (This can also be inferred by carefully inspecting the proof of the

proposition.)

We next present a lower bound on the minimax regret which establishes a fundamental limit on

the performance of any admissible pricing policy. Roughly speaking, the main idea behind this

result is to construct a “worst case” demand function such that the regret is large for any policy.

12

Proposition 2 Let Assumption 1 hold with M, K satisfying M ≥ max{2Kp, Kp + x/T}. Then

there exists a finite positive constant C ′ such that for any sequence of admissible policies {πn} and

for all n ≥ 1

supλ∈L

Rπn(x, T ; λ) ≥

C ′

n1/2. (14)

Combining Propositions 1 and 2, one can characterize the magnitude of the minimax regret as

follows:C ′

n1/2≤ inf

π∈Pn

supλ∈L

Rπn(x, T ; λ) ≤

C(log n)1/2

n1/4, (15)

and hence the performance of Algorithm 1 is “not far” from being minimax optimal (i.e., achieving

the lower bound). A question that remains open is whether one can close this gap by further

refining the algorithm. We revisit this point in Sections 5 and 7.

5 Main Results: The Parametric Case

In this section we assume the demand function is known to have a parametric form. Our goal is

to develop pricing policies that exploit this information and that will work well for a large class

of admissible parametric demand functions. One of the main questions of interest here is whether

these policies achieve a smaller regret relative to the nonparametric case, and if there exist policies

whose performance cannot be improved upon.

Preliminaries. Let k denote a positive integer and L(Θ) = {λ(·; θ) : θ ∈ Θ} be a parametric

family of demand functions, where Θ ⊆ Rk is assumed to be a convex compact set, θ ∈ Θ is a

parameter vector, and λ : R+×Θ → R+. We consider all parametric families that are subsets of the

class of admissible regular demand functions defined in Section 4, i.e., L(Θ) ⊂ L := L(M, K, K, m).

In the current setting, we will assume that the decision maker knows that λ ∈ L(Θ), i.e.,

s/he knows the parametric structure of the demand function, but does not know the value of the

parameter vector θ. We continue to denote by P the set of admissible pricing policies, i.e., policies

that satisfy (2)-(3). For any policy π ∈ P, let Jπ(x, T ; θ) denote the expected revenues under

π, and let JD(x, T |θ) denote the value of the deterministic relaxation (5) when the value of the

unknown parameter vector is revealed to the decision maker prior to the start of the selling season.

Let Rπ(x, T ; θ) denote the regret under a policy π, namely

Rπ(x, T ; θ) = 1 −Jπ(x, T ; θ)

JD(x, T |θ). (16)

5.1 The proposed method

We consider a simple modification of the approach taken in Algorithm 1 that now exploits the

assumed parametric structure of the demand function.

13

Algorithm 2: π(τ )


(a) Set the learning interval to be [0, τ ]. Put ∆ = τ/k, where k is the dimension of the

parameter space Θ.

(b) Choose a set of k prices, {pi, i = 1, ..., k}.


(a) On the interval [0, τ ] apply pi from ti−1 = (i − 1)∆ to ti = i∆, i = 1, ..., k, as long as

the inventory is positive. If no more units are in stock, apply p∞ up until time T and

STOP.

(b) Compute

di =total demand over [ti−1, ti)

∆, i = 1, ..., k.

(c) Let θ be a solution of {λ(pi; θ) = di, i = 1, . . . , k}

Step 3. Optimization:

pu(θ) = arg max{pλ(p; θ) : p ∈ [p, p]},

pc(θ) = arg min{|λ(p; θ) − x/T | : p ∈ [p, p]},

set p = max{pu(θ), pc(θ)}.

Step 4. Pricing:

On the interval (τ, T ] apply p as long as inventory is positive, then apply p∞ for the remaining

time.

Note that Step 1(b) requires one to define the prices {p1, ..., pk} and Step 2(c) implicitly assumes

that the system of equations admits a solution. (We define the prices and state this assumption more

formally when analyzing the performance of Algorithm 2.) The intuition behind this algorithm is

similar to the one that underlies the construction of Algorithm 1, the only difference being that the

parametric structure allows one to infer accurate information about the demand function using only

a “small” number of “test” prices (k). In particular, recalling the discussion following Algorithm

1, a key difference is that the deterministic error source associated with price granularity does not

affect the performance of Algorithm 2.

14

5.2 The parametric class of demand functions

First note that the inclusion L(Θ) ⊂ L implies that λ(·; θ) has an inverse for all θ ∈ Θ; this

inverse will be denoted γ(·; θ). (Note also that conditions (i.)-(iii.) in Assumption 1 hold.) For any

parameter vector θ ∈ Θ we denote the revenue function by r(l; θ) := γ(l; θ)l. Let {Pθ : θ ∈ Θ} be

the family of demand distributions corresponding to Poisson processes with controlled intensities

λ(·; θ), θ ∈ Θ. We denote by fλ(p;θ)(·) the probability mass function of a Poisson random variable

with intensity λ(p; θ).

The following technical conditions articulate standard regularity assumptions in the context of

Maximum Likelihood estimation (cf. Borovkov (1998)), and are used to define the admissible class

of parametric demand functions.

Assumption 2

(i.) There exists a vector of distinct prices ~p = (p1, ..., pk) ∈ [p, p] such that:

a) For some l0 > 0, min1≤i≤k infθ∈Θ λ(pi; θ) > l0.

b) For any vector ~d = (d1, . . . , dk), the system of equations {λ(pi; θ) = di, i = 1, . . . , k} has a

unique solution in θ. Let g(~p, ~d) denote this solution. We assume in addition that g(~p, ·)

is Lipschitz continuous with constant α > 0.

c) For i = 1, ..., k,√

λ(pi; θ) is differentiable on Θ.

(ii.) For some K2 > 0, |λ(p; θ) − λ(p; θ′)| ≤ K2‖θ − θ′‖∞ for all p ∈ [p, p] and θ, θ′ ∈ Θ.

Condition (i.) ensures that the parametric model is identifiable based on a sufficient set of obser-

vations. Condition (ii.) is a mild regularity assumption on the parametric class, controlling for

changes in the demand curve as parameters vary. We provide below an example of a parametric

class that satisfies Assumption 2.

Example 1 (Linear demand function) Let λ(p; θ) = θ1−θ2p, and set [p, p] = [1, 2]. Let L(Θ) =

{λ(·; θ) : θ ∈ [10, 20]× [1, 4]}. If one sets (p1, p2) = (1, 2), it is straightforward to verify conditions

(i.)a), (i.)c) and that (ii.) is satisfied with K2 = 1. It is also easy to see that the unique solution

associated with condition (i.)b) is given by

g(p1, p2, d1, d2) =(d1 +

p1

p2 − p1

[d1 − d2

],

1

p2 − p1

[d1 − d2

])

and g(p1, p2, d1, d2) is clearly Lipschitz continuous with respect to (d1, d2).

5.3 Performance analysis

Suppose that the initial inventory level and the demand function are scaled according to (11), and

denote the regret by Rπn(x, T ; θ) = 1 − Jπ

n (x, T ; θ)/JDn (x, T |θ).

15

Proposition 3 Let Assumptions 1 and 2 hold and let {p1, ..., pk} be as in Assumption 2(i.). Set

τn ≍ n−1/3 and let πn := π(τn) be defined by Algorithm 2. Then the sequence of policies {πn} is

asymptotically optimal and satisfies

supθ∈Θ

Rπn(x, T ; θ) ≤

C(log n)1/2

n1/3, (17)

for all n ≥ 1 and some finite positive constant C.

Contrasting the above with Proposition 1, we observe that the parametric structure of the de-

mand function translates into improved performance bounds for the learning and pricing algo-

rithm that is designed with this knowledge in mind. In particular, the regret is now of order

Rn = O((log n)1/2n−1/3) as opposed to Rn = O((log n)1/2n−1/4) in the nonparametric case. In

addition, note that the upper bound above holds for all admissible parametric families. The im-

provement in terms of generated revenues, as quantified by the smaller magnitude of the regret,

spells out the advantages of using a parametric approach versus a nonparametric one. The down-

side, namely, model misspecification and its consequences, is illustrated and discussed in Section

6.

At an intuitive level, classical estimation theory tells us that parameter uncertainty cannot be

resolved faster than rate n−1/2 with n observations. This suggests that for the asymptotic regime

we consider in this paper, no admissible pricing policy would be able to achieve a convergence rate

faster than n−1/2. This intuition is made rigorous in the next proposition.

Proposition 4 Let Assumption 1 hold with M, K satisfying M ≥ max{2Kp, Kp + x/T}. Then,

there exists a parametric family L(Θ) ⊂ L(M, K, K, m) satisfying Assumption 2 such that for some

positive constant C ′ and for all admissible policies {πn}

supθ∈Θ

Rπn(x, T ; θ) ≥

C ′

n1/2, (18)

for all n ≥ 1.

This result follows in a relatively straightforward manner from Proposition 2.

5.4 An optimal algorithm when a single parameter is unknown

Given the lower bound in (18), the natural question that arises is whether it is possible to close the

remaining gap with respect to the upper bound in Proposition 3. We study this question in the

context of a single unknown parameter, i.e., k = 1 and hence Θ ⊆ R. Consider the following ℓ-step

policy π(ℓ, ∆(1), . . . , ∆(ℓ)).

16

Algorithm 3: π(ℓ,∆(), . . . ,∆(ℓ))


(a) Set the number of steps to be ℓ and define ∆(i), i = 1, . . . , ℓ so that ∆(1) + ...+∆(ℓ) = T .

(b) Choose a price p1 ∈ [p, p] as in Assumption 2(i.).

Step 2. Learning/Optimization/Pricing:

Set t1 = 0.

For i = 1, ..., ℓ,

(a) Learning/Pricing:

i.) Apply pi on the interval [ti, ti +∆(i)) as long as the inventory is positive. If no more

units are in stock, apply p∞ up until time T and STOP.

ii.) Compute

di =Total demand over [ti, ti + ∆(i))

∆(i).

iii.) Set ti+1 = ti + ∆(i).

iv.) Let θi be the unique solution of λ(pi, θ) = di.

(b) Optimization:

pu(θi) = arg max{pλ(p; θi) : p ∈ [p, p]},

pc(θi) = arg min{|λ(p; θi) − x/t| : p ∈ [p, p]},

pi+1 = max{pu(θi), pc(θi)}.

End For

The intuition underlying Algorithm 3 is as follows. When a single parameter is unknown, one

can infer information about the parameter from observations of demand at a single price. Given

this, the idea of Algorithm 3 is to price “close” to pD after the first stage, but to continue learning.

In particular, the estimate of pD is improved from stage to stage using the demand observations

from the previous stage. As a result, losses are mitigated by two effects: i.) after stage 1, the

price is always close to pD; and ii.) the estimate of pD becomes more precise. For what follows, we

slightly strengthen Assumption 2(i.)

Assumption 3 infp∈[p,p] infθ∈Θ λ(p; θ) > l0 and for any price p ∈ [p, p] and any d ≥ 0, the equation

λ(p; ·) = d has a unique solution. If g(p, d) denotes this solution, then g(p, ·) is Lipschitz continuous

with constant α > 0.

17

We now analyze the performance of policies associated with Algorithm 3. In particular suppose

that the initial inventory level and the demand function are scaled according to (11), and define

the sequence of tuning parameters {ℓn, ∆(1)n , . . . , ∆

(ℓn)n } as follows:

ℓn = (log 2)−1 log log n (19)

∆(m)n = βnn(aℓn/am)−1, m = 1, ..., ℓn, (20)

where am = 2m−1/(2m − 1) for m ≥ 1, and βn > 0 is a normalizing constant chosen so ∆(1)n + .... +

∆(ℓn)n = T . We then have the following result.

Proposition 5 Let Assumptions 1,2 and 3 hold. Let {ℓn, ∆(1)n , . . . , ∆

(ℓn)n } be defined as in (19)-

(20) and put πn := π(ℓn, ∆(1)n , . . . , ∆

(ℓn)n ), defined by Algorithm 3. Then the sequence of policies

{πn} is asymptotically optimal and satisfies

supθ∈Θ

Rπn(x, T ; θ) = O

((log log n)(log n)1/2

n1/2

). (21)

Note that the regret of the sequence for policies introduced in Proposition 5 achieves the lower

bound spelled out in Proposition 4 (up to logarithmic terms). In that sense, these policies cannot

be improved upon. On the other hand, Algorithm 3 is restricted to the case where only one

parameter is unknown, and exploits the fact that in this setting it is possible to learn the single

unknown parameter by conducting price experiments in the neighborhood of the near-optimal price

pD. Designing policies that achieve the lower bound in Proposition 4 in the multi-parameter case

remains an open question.

6 Numerical Results and Qualitative Insights

6.1 Performance of the parametric/nonparametric policies and the “price” of

uncertainty

We examine the performance of three polices developed in the previous sections: i.) The nonpara-

metric policy defined in Algorithm 1; ii.) The parametric policy defined by means of Algorithm 2

and designed for are a finite number of unknown parameters; and iii.) The parametric policy de-

fined in Algorithm 3 designed for cases with a single unknown parameter. (The tuning parameters

are taken as in Propositions 1, 3 and 5, respectively.)

The performance of these policies are measured by the magnitude of the regret. Note that

Rπn(x, T ; λ) ≈ C/nγ implies that logRπ

n(x, T ; λ) should be approximately linear in log n with slope

−γ. In Figure 1, we depict Rπn(x, T ; λ) as a function of n in a log-log plot for large values of n,

and compute the best (least squares) linear fit for these values. The results depicted are based

on running 103 independent simulation replications from which the performance indicators were

derived by averaging. The standard error for Rπn(x, T ; λ) was below 0.05% in all cases.

18

Figure 1(a) summarizes results for an underlying exponential demand model λ(p) = θ exp(−.5p),

where θ = 10 exp(1), and Figure 1(b) presents results for an underlying linear demand model

λ(p) = 30 − θp where θ = 3. The nonparametric algorithm does not make any assumptions with

regard to the structure of the demand function (beyond those spelled out in Section 4), while the

parametric algorithms are assumed to know the parametric structure but the true value of θ is

not revealed to them. In both cases, the initial normalized inventory level was x = 20, the selling

horizon was T = 1 and the feasible price set was [p, p] = [0.1, 10].

102

103

104

105

106

10−4

10−3

10−2

10−1

100

102

103

104

105

106

10−5

10−4

10−3

10−2

10−1

100

nn

Rπ n(x

,T;λ

)

Rπ n(x

,T;λ

)

slope −0.25

slope −0.33

slope −0.48

slope −0.24

slope −0.37

slope −0.54

(a) (b)

Figure 1: Performance of pricing policies as a function of the market size (n). Stars show the

performance of the nonparametric policy defined in Algorithm 1; dots represent the performance of

the parametric policy defined Algorithm in 2; and squares depict the performance of the parametric

policy defined in Algorithm 3. The dashed lines represent the best linear fit to each set of points;

in panel (a), the demand function is exponential and in panel (b) linear.

Discussion. The slopes of the best linear fit in Figures 1(a) and 1(b) are very close to γ = −1/4,

γ = −1/3 and γ = −1/2, predicted by the upper bounds in (13), (17) and (21), respectively. These

results provide a “picture proof” of the Propositions 1, 3 and 5. As is evident, the less structure is

assumed a priori, the higher the profit loss relative to the full information benchmark: informally

speaking, this is the “price” paid due to increasing uncertainty with regard to the demand model.

An additional fundamental difference between the various policies concerns the degree to which

the price domain is explored. The nonparametric policy (Algorithm 1) essentially needs to explore

the “entire” price domain; the general parametric policy (Algorithm 2) needs to test a number of

prices equal to the number of unknown parameters; and the parametric policy designed for the case

19

of a single unknown parameter (Algorithm 3) explores only prices in the vicinity of the price pD.

In other words, as uncertainty “decreases,” the price exploration region shrinks as well.

6.2 The “price” of hedging misspecification risk

We have seen in the previous section that more refined information regarding the demand model

yields higher revenues. Thus, it becomes tempting to postulate parametric structure, which in

turn may be incorrect relative to the true underlying demand model. This misspecification risk

can be eliminated via nonparametric pricing policies, but at the price of settling for more modest

performance revenue-wise. We now provide an illustration of this trade-off.

We fix the time horizon to be T = 1 and the set of feasible prices to be [p, p] = [0.1, 10]. Figure

2 depicts the regret Rπn(x, T ; λ) for two underlying demand models, various values of n, and three

policies. The first policy assumes an exponential parametric structure for the demand function,

λ(p; θ) = θ1 exp(−θ2p) with θ = (θ1, θ2), and applies Algorithm 2 with τn = n−1/3. The second

policy assumes a linear parametric structure λ(p; θ) = (θ1 − θ2p)+ and applies Algorithm 2 with

τn = n−1/3. Finally, the third policy is the one given by the nonparametric method described

in Algorithm 1 (with tuning parameters as in Proposition 1). The point of this comparison is

to illustrate that the parametric algorithm outperforms its nonparametric counterpart when the

assumed parametric model is consistent with the true demand function, otherwise the parametric

algorithm leads to a regret that does not converge to zero due to a model misspecification error.

In Figures 2(a),(b) the underlying demand model is given by λ(p) = a exp(−αp) with a =

10 exp(1) and α = 1, and hence the policy that assumes the exponential structure (squares) is

well specified, while the policy that assumes a linear demand function (crosses) suffers from model

misspecification.

In Figures 2(c),(d) the underlying demand model is given by λ(p) = (a − αp)+ with a = 30

and α = 3. Now the policy that assumes the exponential structure (squares) corresponds to a

misspecified case, while the policy that assumes a linear demand function (crosses) corresponds to

a well specified case.

In Figures 2(a),(c) the normalized inventory is x = 8 and in Figures 2(b),(d) it is taken to

be x = 20. The results depicted in the figures are based on running 103 independent simulation

replications from which the performance indicators were derived by averaging. The standard error

was below 0.9% in all cases.

Discussion. We focus on Figures 2(a),(b) as similar remarks apply to Figures 2(c),(d). First,

note that the parametric algorithm is able to achieve close to 90% of the full information revenues

when the parametric model is well specified, when the market size is n = 100 or more. In addition,

the convergence of the regret for Algorithm 2 is faster for small n under the well specified parametric

assumption (squares) in comparison to the nonparametric algorithm (stars). [We note that the non-

20

101

102

103

104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

101

102

103

104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

101

102

103

104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

101

102

103

104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

nn

nn

Rπ n(x

,T;λ

)

Rπ n(x

,T;λ

)

Rπ n(x

,T;λ

)

Rπ n(x

,T;λ

)

(a) (b)

(c) (d)

[np]

[np]

[np] [np]

[p], misspecified[p], misspecified

[p], misspecified[p], misspecified

[p], well specified

[p], well specified

[p], well specified[p], well specified

Figure 2: Performance of the parametric (Algorithm 2) and nonparametric policies (Algorithm 1)

as a function of the market size: the parametric algorithm that assumes a linear model (crosses) and

exponential model (squares); and the nonparametric algorithm (stars). The underlying demand

model is exponential in Figures (a) and (b) and linear in Figures (c) and (d). Arrows indicate if the

approach used is nonparametric [np] or parametric [p], and whether the model is well/misspecified

in the case of the parametric algorithm.

monotonic performance of the nonparametric algorithm (stars) stems from the fact that the price

grid used by this algorithm changes with n, and consequently the minimal distance of the sought

fixed price to any price on the grid need not be monotonic with respect to n.]

In contrast, if the model is misspecified, the regret Rπn(x, T ; λ) converges to a strictly positive

value (crosses), as expected, and hence the performance of the parametric algorithm fails to asymp-

totically achieve the full information revenues. The takeaway message here is that a nonparametric

approach eliminates the risk stemming from model misspecification but at a price of extracting lower

21

revenues than its parametric counterpart. The regret bounds derived in earlier sections precisely

spell out the magnitude of this “price” of misspecification.

6.3 Numerical analysis of the bounds in Propositions 1 and 3

The results of Propositions 1 and 3 derive upper bounds on the regret that involve constants whose

values have not been specified. Ideally, one would like to characterize the smallest possible constants

for which the upper bounds hold, in order to determine the non-asymptotic behavior of the policies.

While an exact analysis along these lines is essentially intractable, it is possible to investigate this

point numerically and assess the “typical” magnitude of these constants.

Fix T = 1 and [p, p] = [5, 10], and consider two families of demand functions: the first cor-

responds to an exponential class L1 = {a exp{−αp} : a ∈ [a, a], α ∈ [α, α]} where [a, a] =

[5, 10] and [α, α] = [0.1, 0.2]; the second corresponds to a linear class L2 = {(b − βp)+ : b ∈

[b, b], β ∈ [β, β]} where [b, b] = [10, 20] and [β, β] = [0.2, 1]. Note that all of the functions in

the union of L1 and L2 are bounded by M = max{a exp{−αp}, b − βp} = 19, are K-Lipschitz

with K = max{aα exp{−αp}, β} = 1.21, and have a first derivative bounded below by K =

min{aα exp{−αp}, β} = 0.067. Similarly, the maximum revenue rate is bounded below by m =

min{pa exp{−αp}, p(b − βp)} = 3.38.

To estimate the constants characterizing the revenue losses of the parametric and nonparametric

policies we proceed as follows. We draw 100 demand functions from each class of demand models

above, using a uniform distribution over each interval defining the space of parameters. For each

function, we estimate the regret associated with the proposed nonparametric (Algorithm 1) and

parametric (Algorithm 2) policies for market sizes of n = 100, n = 103 and n = 104, and for

two initial (normalized) inventory levels (x = 5, x = 10). This step is executed as in the previous

subsections. For each draw and each market size, we compute the value of Rπn(x, T ; λ)nγ , where

γ = 1/4 for the nonparametric policy, and γ = 1/3 for the parametric one. Taking the maximum

of these values over all 100 draws, we arrive at an estimate, Cmax, for each market size, which

corresponds to the worst case constant observed for all models. This value serves as a proxy for

the magnitude of the constants characterizing the performance of the proposed algorithms. Results

are summarized in Table 1.

The reasonable magnitude of the constants (all fall in the range of 0.5-1.3) suggests that our

policies should perform well in practical settings. In particular, based on these results, we expect

the nonparametric policy to achieve at most a regret of 35% for n = 100, 23% for n = 103 and

14% for n = 104, while the parametric policy should achieve a regret of 24% for n = 100, 12% for

n = 103 and 6% for n = 104.

22

demand functions L1 L2

Cmax Cmax

market size (n) 102 103 104 102 103 104

x = 5, πnp 0.92 1.04 1.07 1.09 1.25 1.31

πp 0.79 0.79 0.79 1.11 1.11 1.11

x = 10, πnp 0.66 0.75 0.77 0.74 0.85 0.87

πp 0.61 0.53 0.39 0.58 0.60 0.54

Table 1: Estimates of the magnitude of the constants characterizing the performance of the algo-

rithms: πnp corresponds to the nonparametric policy; and πp to the parametric one.

7 Sequential Learning Strategies in the Nonparametric Setting

The performance guarantees associated with Algorithm 1 rely heavily on how well one is able to

estimate the two prices pu and pc arising from the solution of the deterministic relaxation (5).

In Algorithm 1, this estimation is executed in a single stage over the interval [0, τ ]. A natural

extension would be to consider policies in which the learning phase searches for the optimal price

in a sequential and dynamic manner.

Recall the definitions of pu and pc, the unconstrained and constrained solutions of the determinis-

tic relaxation: pu = arg maxp∈[p,p]{r(λ(p))}, pc = arg minp∈[p,p] |λ(p)−x/T |. Suppose for simplicity

that the two critical prices belong to the interior of the feasible price set [p, p]. Then pu is the

unique maximizer of the revenue function and pc is the x/T -crossing of the demand function, which

is assumed to be decreasing. Using this interpretation, one could use stochastic approximation

schemes to search for the two prices rather than via the estimation of the demand function over

the whole price domain (as in Algorithm 1). We detail below how such schemes could be designed.

Background on stochastic approximations. Suppose one wants to estimate the maximizer

z∗ of a real valued function M(z) on a domain [z, z], and that the available observations are in the

form H(z) = M(z) + ǫ(z), where ǫ(z) is a stochastic noise term with E[ǫ(z)] = 0 and E[ǫ2(z)] < ∞

for all z ∈ [z, z]. The basic Kiefer-Wolfowitz (KW) scheme estimates z∗ using the recursion

zk+1 = zk + aky2k − y2k−1

ck, k = 1, 2, . . . , (22)

where y2k = H(zk + ck) and y2k−1 = H(zk − ck), and {ak} and {ck} are predefined determinis-

tic real-valued positive sequences. Under suitable technical conditions on the function M(·) and

the sequences {ak} and {ck}, Kiefer and Wolfowitz (1952) showed that the sequence zk → z∗ in

probability as k → ∞. Similar ideas lead to a sequence {zk} that converges to an α-crossing

{z : M(z) = α} via the Robbins-Monro (RM) procedure (see Robbins and Monro (1951)), viz.

zk+1 = zk + ak(α − yk), (23)

23

where yk = H(zk). For further details and other variants the reader is referred to Benveniste,

Metivier, and Priouret (1990).

A stochastic approximation-based algorithm. A straightforward way to blend adaptive

search procedures into Algorithm 1 is given below. Here h(·) denotes the projection operator

h(y) := min{max{y, p}, p}.

Algorithm 4 : πSA(τ , ∆, {ak}, {ck})


Set the learning interval to be [0, τ ], and set ∆ to be the holding time for each price during

that period.


(a) KW-type scheme:

Set κ1 =⌊τ/(4∆)

⌋, t1 = 0, z1 = p and ti = t1 + (i − 1)∆, i = 2, . . .

For k = 1, ..., κ1

i.) Apply pk = min{zk + ck, p} on [t2k−2, t2k−1) and compute

y2k =(total demand over [t2k−2, t2k−1)

)/∆

ii.) Apply pk = max{zk − ck, p} on [t2k−1, t2k) and compute

y2k−1 =(total demand over [t2k−1, t2k)

)/∆

iii.) Update: zk+1 = h(zk + (ak/ck)(y2k − y2k−1)

)

end For

Set pu = zκ1+1

(b) RM-type scheme:

Set κ2 =⌊τ/(2∆)

⌋, t1 = 2κ1∆, z1 = p and ti = t1 + (i − 1)∆, i = 2, . . .

For k = 1, ..., κ2

i.) Apply pk = min{zk, p} on [tk, tk+1) and compute

yk =(total demand over [tk, tk+1)

)/∆

ii.) Update: zk+1 = h(zk + ak(x/T − yk))

end For

Set pc = zκ2+1

24

Step 3. Optimization: Set p = max{pu, pc}.

Step 4. Pricing: Set τ ′ = (2κ1 + κ2)∆. On the interval (τ ′, T ] apply p as long as

inventory is positive, then apply p∞ for the remaining time.

Comments on the structure of the algorithm. The algorithm exploits the structure of

the single product problem, adaptively estimating each of the two prices pu and pc based on its

characterization as either a point of maximum or a level crossing; the former is executed in Step

2(a) and the latter in the step 2(b). This also requires two consecutive price testing phases to be

carried out. Note that in Step 2 above, the number of iterations is different for the KW and RM

procedures. This is due to the fact that a KW-based procedure uses two prices at each iteration

in order to estimate an improvement direction. (For brevity, in Step 2 it was left implicit that as

soon as inventory is depleted, the shut-off price p∞ is applied.)

Numerical results. We present a comparison between the performance of the policy defined by

Algorithm 1 with tuning parameters as in Proposition 1, and the policy given by Algorithm 4 that

uses stochastic approximations. We use πSA1 to denote the policy which uses Algorithm 4 taking

τn = n−1/4 and ∆n = n−1; this corresponds to the KW-scheme using⌊(1/4)n3/4

⌋steps. Similarly,

we let πSA2 denote the policy which corresponds to taking τn = n−1/4 and ∆n = n−1/2. (The

choice of τn is driven by our previous asymptotic considerations and is meant only for illustrative

purposes.) The key difference between πSA1 and πSA

2 is that in the latter there are fewer test prices

being used, and the observations for a given price are less noisy since each price is held for a longer

time interval.

Table 2 presents results for the three policies discussed above and for two normalized initial

inventory levels (x); if the problem size is n and the total inventory is y then x = y/n. For

each policy we record the regret Rπn(x, T ; λ), and the number of prices κ that are used during the

interval [0, T ]. To generate the results in Table 2 we take the underlying demand function to be

linear, λ(p) = (10 − 2p)+ and the set of feasible prices is taken to be [p, p] = [0.1, 4.5]. In the

first case where x = 3, and hence pc = 3.5 and pu = 2.5, the constrained price pc is optimal in

the deterministic relaxation (5). In the second case, there is a larger initial normalized inventory

x = 8 and the prices of interest are given by pc = 1 and pu = 2.5, hence the unconstrained price is

optimal in the deterministic relaxation.

The results depicted in the table are based on running 500 independent simulation replications from

which the performance indicators were derived by averaging. The standard error for Rπn(x, T ; λ)

was below 0.8% in all cases. (We note that numerical experiments with other simple demand

models give rise to similar results and therefore, for space considerations, we only report on the

25

Problem “size” n = 102 n = 103 n = 104

Rπn(x, T ; λ) κ Rπ

n(x, T ; λ) κ Rπn(x, T ; λ) κ

x = 3, π .44 5 .19 7 .12 12

πSA1 .25 33 .12 180 .05 1002

πSA2 .27 12 .12 18 .05 33

x = 8, π .86 5 .08 7 .04 12

πSA1 .28 33 .17 180 .10 1002

πSA2 .22 12 .08 18 .01 33

Table 2: Performance of stochastic approximation based schemes. Rπn(x, T ; λ) represents the regret

associated with a given policy, and x is the normalized inventory level. Here π is the basic non-

parametric pricing policy defined by Algorithm 1, πSA1 and πSA

2 correspond to two implementations

of Algorithm 4, and κ = number of prices used by a given policy.

linear demand example.)

Discussion. A quick inspection of the results in Table 2 leads to the following observations.

First, all three policies are seen to have comparable performance achieving a regret which is smaller

than 20% for problems where the inventory is of the order of a thousand; this translates to excess

of 80% of the optimal full information revenues. It is also interesting to contrast the structure of

the two adaptive policies πSA1 and πSA

2 . We observe that the former uses significantly more prices

than the latter (180 prices for a problem with n = 103, compared with 18 for πSA2 ) and hence is

not very practical to implement. Considering that the performance of πSA1 and πSA

2 is comaprable,

it seems that choosing the length ∆ so that observations are less noisy is potentially pereferable,

even though less price changes are allowed. The nonparametric policy π uses less than half of the

price changes that πSA2 utilizes, and is overall the simplest to implement.

A detailed performance analysis of Algorithm 4 is beyond the scope of the paper. However,

based on convergence rate results associated with stochastic approximation schemes (see Polyak

and Tsybakov (1990) and Polyak and Juditsky (1990)), we expect that the policies associated with

Algorithm 4 would achieve at best a regret of order Rn ≈ n−1/3.

In Algorithm 4, we use a KW scheme that seeks an estimate of pu (the unconstrained maximizer

of r(·)) and an RM-based scheme that seeks an estimate of pc (the inventory-constrained “run

out” price). Only after these estimates are obtained, by running each stochastic approximation

procedure separately, can one attempt to deduce which of the two prices is “near-optimal” for

the given problem. An interesting direction is to investigate methods for finding an estimate of

the optimal price of the deterministic relaxation using a single combined procedure. Stochastic

approximations for constrained problems exist (see, e.g., Kushner and Sanvicente (1975)), however,

to the best of our knowledge there are still open questions with regard to their performance and

rates of convergence.

26

A Proofs of Main Results

Preliminaries and notation. For any real number x, x+ will denote max{x, 0}. C1, C2, . . .

will be used to denote positive constants which are independent of a given demand function, but

may depend on the parameters of the class L of admissible demand functions and on x and T . A

sequence {an} of real numbers is said to increase to infinity at a polynomial rate if there exist a

constant β > 0 such that lim infn→∞ an/nβ > 0. To lighten the notation, we will occasionally omit

the arguments of Jπn (x, T ; λ) and JD

n (x, T |λ). In what follows we will make use of the following

facts.

Fact 1. Recall the definition of JD(x, T |λ), the optimal value of the deterministic relaxation (5).

First note that JDn = nJD. We will also use the fact that

infλ∈L

JD(x, T |λ) ≥ mD,

where mD = mT ′ > 0 and T ′ = min{T, x/M}. Indeed, for any λ ∈ L, there is a price q ∈ [p, p]

such that r(q) ≥ m. Consider the policy that applies q on [0, T ′] and then applies p∞ up until T .

This solution is feasible since λ(q)T ′ ≤ MT ′ ≤ x. In addition the revenues generated from this

policy are given by mT ′.

We provide below a result that generalizes Gallego and van Ryzin (1994, Proposition 2) and that

is used throughout the proofs of the main results. Its proof can be found in Appendix B.

Lemma 1 The solution to problem (5) is given by p(s) = pD := max{pu, pc} for s ∈ [0, T ′] and

p(s) = p∞ for s > T ′, where pu = arg maxp∈[p,p]{r(λ(p))}, pc = arg minp∈[p,p] |λ(p) − x/T | and

T ′ = min{T, x/λ(pD)}. In addition, for all λ ∈ L, the optimal value of (5), JD(x, T |λ), serves as

an upper bound on Jπ(x, T ; λ) for all π ∈ P.

Before proceeding, we state the following lemma which is needed throughout the analysis and

whose proof is deferred to Appendix B.

Lemma 2 Suppose that µ ∈ [0, M ] and rn ≥ nβ with β > 0. If ǫn = 2η1/2M1/2(log n)1/2r−1/2n ,

then for all n ≥ 1

P

(N(µrn) − µrn > rnǫn

)≤

C

nη,

P

(N(µrn) − µrn < −rnǫn

)≤

C

nη,

for some suitably chosen constant C > 0.

This lemma bounds the deviations of a Poisson process from its mean. It will be used to control

for the estimates of the demand function evaluated at given prices.

27

A.1 Proofs of the results in Section 4

Proof of Proposition 1. Fix λ ∈ L, η = 2. We consider the sequence of policies πn :=

π(τn, κn), n = 1, 2, . . . defined by means of Algorithm 1. The proof is organized in four steps.

The first step develops an expression for a lower bound on the expected revenues achieved by the

proposed policy. This lower bound highlights the terms that need to be analyzed. The second step

provides probabilistic bounds on the estimate p of the price pD and the associated revenue rate.

This is done by controlling the deviations of a Poisson process from its mean (see Lemma 3). The

third step finalizes the analysis of the lower bound. For this step, two cases have to be considered

separately depending on the starting level of inventory. The key issue here is to control for stochastic

fluctuations of customer requests at the estimated price p. In the last step, we conclude the proof

by combing the results from the two cases and plugging in the tuning parameters.

Let τn be such that τn → 0 and nτn → ∞ at a polynomial rate as n → ∞. Let κn be a sequence

of integers such that κn → ∞ and n∆n := nτn/κn → ∞ at a polynomial rate. We divide the

interval [p, p] into κn equal length intervals and we let Pn = {pi, i = 1, ..., κn} be the left endpoints

of these intervals. Now partition [0, τn] into κn intervals of length ∆n and apply the price pi on the

ith interval. Define

λ(pi) =N

(∑ij=1 nλ(pj)∆n

)− N

(∑i−1j=1 nλ(pj)∆n

)

n∆n, i = 1, ..., κn, (A-1)

where N(·) is a unit rate Poisson process. Thus λ(pi) denotes the number of product requests over

successive intervals of length ∆n, normalized by n∆n.

Step 1. Here, we derive a lower bound on the expected revenues under the policy πn. Let

X(L)n =

∑κn

i=1 λ(pi)n∆n, X(P )n = λ(p)n(T − τn) and put Yn = N(X

(L)n + X

(P )n ), Y

(L)n = N(X

(L)n )

and Y(P )n = Yn − Y

(L)n . Y

(L)n represents the maximum number of requests during the learning

phase if the system would not run out of resources, Y(P )n the maximum number of requests during

the pricing phase and Yn the maximum total number of requests throughout the sales horizon.

Now note that one can lower bound the revenues achieved by πn by those accumulated during the

pricing phase and during the latter, the maximum number of units that can be sold is exactly

min{Y

(P )n ,

(nx − Y

(L)n

)+}. We deduce that

Jπn ≥ E

[p min

{Y (P )

n ,(nx − Y (L)

n

)+}](A-2)

Next, we analyze the lower bound above by getting a handle on p, the estimate of pD and the

quantities Y(P )n and Y

(L)n .

Step 2. Here, we analyze the estimate p through the estimates pu and pc. Recall the definition

of

pu = arg max1≤i≤κn

{piλ(pi)}, pc = arg min1≤i≤κn

|λ(pi) − x/T |,

28

which are estimates of pu and pc, respectively. We define the following quantity

un = (log n)1/2 max{

1/κn, 1/(n∆n)1/2}

, (A-3)

that will be used throughout the analysis to quantify deviations of various quantities associated

with the proposed policy from their full information counterparts. We provide below a result that

characterizes the revenue rate at p = max{pc, pu} as well as pc. The proof of this lemma can be

found in Appendix B.

Lemma 3 For some positive constants C1, C2, C3, for and all n ≥ 1,

P{r(λ(pD)) − r(λ(p)) > C1un

}≤

C3

nη−1, (A-4)

P{|pc − pc| > C2un

}≤

C3

nη−1· (A-5)

Step 3. In what follows we will separate two cases: λ(p) ≤ x/T and λ(p) > x/T . The latter

case is one where the decision-maker would on average run out of inventory during the sales horizon

and has to be analyzed separately.

Case 1. Suppose first that λ(p) ≤ x/T . Note that min{Y

(P )n ,

(nx−Y

(L)n

)+}= Y

(P )n −

(Yn−nx

)+.

Hence, using (A-2) and the fact that p ≤ p, we have

Jπn ≥ E

[pY (P )

n

]− pE

[(Yn − nx)+

]

= E

[E

[pN

(λ(p)n(T − τn)

) ∣∣∣ p]]

− pE

[(Yn − nx)+

]

= E[r(λ(p))]n(T − τn) − pE

[(Yn − nx)+

]. (A-6)

Now,

E[r(λ(p))] = r(λ(pD))

+E

[r(λ(p)) − r(λ(pD))

∣∣∣ r(λ(pD)) − r(λ(p)) ≤ C1un

]P

{r(λ(pD)) − r(λ(p)) ≤ C1un

}

+E

[r(λ(p)) − r(λ(pD))

∣∣∣ r(λ(pD)) − r(λ(p)) > C1un

]P

{r(λ(pD)) − r(λ(p)) > C1un

}

(a)

≥ r(λ(pD)) − C1un −C4

nη−1, (A-7)

where (a) follows from Lemma 3 and one can take C4 = 2pMC3 (recall that λ(·) is bounded above

by M under Assumption 1 (i) and hence r(·) is bounded above by pM .) We now turn to analyze

the second term on the RHS of (A-6). The following result gives a probabilistic bound on the

deviations of the random variable Yn from nx and its proof is deferred to Appendix B.

Lemma 4 For some positive constants C5, C6, for and all n ≥ 1,

P(Yn − nx > C5nun

)≤

C6

nη−1(A-8)

29

Using (A-8), we have

E

[(Yn − nx)+

]≤ C5nun + E

[(Yn − nx)+ ; Yn − nx ≥ C5nun

]

(a)

≤ C5nun + (nx + C5nun + 1 + nM)C6

nη−1

≤ C7nun, (A-9)

where C7 is a suitably large positive constant, (a) follows from the fact that for a Poisson random

variable Z with mean µ, E[Z | Z > a] ≤ a + 1 + µ. We thus conclude from (A-7) and (A-9) that

Jπn ≥

[r(λ(pD)) − C1un −

C3

nη

]n(T − τn) − pC7nun

≥ nr(λ(pD))T − C8n(un + τn), (A-10)

where C8 is a suitably chosen constant. Now, from (A-10) and the definition of mD in Fact 1 in

the preamble of the appendix, we have that

Jπn

JDn

=Jπ

n

nJD≥ 1 −

C8

mD(un + τn). (A-11)

Case 2. Now suppose λ(p) > x/T . In this case, note that pc = pD = p. Recalling the lower

bound expression in (A-2), we seek to get a handle on min{Y

(P )n ,

(nx − Y

(L)n

)+}. The following

result, whose proof can be found in Appendix B, says that the previous quantity will be “close” to

the total units in inventory, nx, with very high probability.

Lemma 5 Define A := {ω : min{Y

(P )n ,

(nx−Y

(L)n

)+}≥ nx−C9un, |p− pD| ≤ C2un}. Then for

some positive constants C9, C10, for and all n ≥ 1,

P(A) ≥ 1 −C10

nη−1(A-12)

The revenues generated by πn can be bounded below as follows

Jπn ≥ E

[pmin

{Y (P )

n ,(nx − Y (L)

n

)+}]

≥ E

[pmin

{Y (P )

n ,(nx − Y (L)

n

)+} ∣∣∣ A]P(A)

(a)

≥ E

[(pD − C2un)min

{Y (P )

n ,(nx − Y (L)

n

)+} ∣∣∣ A]P(A)

(b)

≥ (pD − C2un)(nx − C9nun)(1 −

C10

nη−1

)(A-13)

≥ pDnx − nC11un (A-14)

where both (a) and (b) follow from the definition of A and (A-12) and where C10 > 0 is suitably

large. Now, recall that since λ(p) > x/T , pD = p and JDn = nxp. Hence

Jπn

JDn

=Jπ

n

nxp≥ 1 −

C11

pxun. (A-15)

30

Step 4. Let C12 = max{C8/(mD), C11/(px)}. Combining (A-11) and (A-15), we have for all

λ ∈ L, Rπn(x, T ; λ) ≤ C12(un + τn), and note that C12 does not depend on the specific function

λ ∈ L. Note that the choice of tuning parameters that minimizes the order of the the upper bound

on the regret is exactly τn ≍ n−1/4 and κn ≍ n1/4. Plugging in, we get for some C13 > 0

supλ∈L

Rπn(x, T ; λ) ≤

C13(log n)1/2

n1/4.

The proof is complete.

Proof of Proposition 2. Since Jπn (x, T ; λ) ≤ J∗

n(x, T |λ), the result will be established if we can

prove that for some C > 0

supλ∈L

(1 −

J∗n(x, T |λ)

JDn (x, T |λ)

)>

C

n1/2· (A-16)

Consider the demand function λ(p) = (a − bp)+ with b ∈ [K, K] and a = max{2Kp, Kp + x/T}.

Note in particular that λ ∈ L since a ≤ M . In addition, for this demand function we have

pu = pc = p. By Proposition 1 in Gallego and van Ryzin (1994), any dynamic pricing policy will

never price below pu, hence the optimal dynamic pricing policy is to price at pu until T or the time

when inventory is depleted. We deduce that

J∗n(x, T |λ) = pu

E[N(nλ(pu)T )] − puE[(N(nλ(pu)T ) − nx)+]

= nr(pu)T − puE[(N(nx) − nx)+]

= JDn (x, T |λ) − pE[(N(nx) − nx)+]

= JDn (x, T |λ) − pnxe−nx (nx)nx

(nx)!·

Hence,

JDn (x, T |λ) − J∗

n(x, T |λ)

JDn (x, T |λ)

=1

npλ(p)Tpnxe−nx (nx)nx

(nx)!.

Using Sterling’s approximation, n1/2(1 − J∗n(x, T |λ)/JD

n (x, T |λ)) → C1 > 0 as n → ∞ and hence

(A-16) is established. This completes the proof.

A.2 Proofs of the results in Section 5

In this section, we define for any θ ∈ Θ, pD(θ) := max{pu(θ), pc(θ)}, where pu(θ) = arg max{pλ(p; θ) :

p ∈ [p, p]}, pc(θ) = arg min{|λ(p; θ) − x/T | : p ∈ [p, p]}. In addition, we define Dλ(θ) = {l : l =

λ(p; θ) for some p ∈ [p, p]}.

Proof of Proposition 3. In what follows, we let θ∗ denote the true parameter value. The

proof follows the structure of the proof of Proposition 1. The first step provides a lower on the

revenues achieved by the proposed policy. The second step bounds the difference between the

estimated parameter vector and its true value and analyzes the revenues at pu(θ), pc(θ) and p =

31

max{pu(θ), pc(θ)}. Step 3 finalizes the analysis of the lower bound. The proof concludes with the

resulting upper bound obtained for the regret.

Step 1. Here, we derive a lower bound on the expected revenues under the policy πn. Let

X(L)n =

∑ki=1 λ(pi)n∆n, X

(P )n = λ(p)n(T − τn) and put Yn = N(X

(L)n + X

(P )n ), Y

(L)n = N(X

(L)n )

and Y(P )n = Yn − Y

(L)n . As in the proof of Proposition 1, we have

Jπn ≥ E

[p min

{Y (P )

n ,(nx − Y (L)

n

)+}](A-17)

Next, we analyze the lower bound above by getting a handle on p, the estimate of pD and the

quantities Y(P )n and Y

(L)n .

Step 2. Here, we analyze the estimate p.

We first analyze pu(θ). Note that for any p ∈ [p, p] and θ, θ′ ∈ Θ, we have

|r(λ(p; θ) − r(λ(p; θ′); θ′)| = |pλ(p; θ) − pλ(p; θ′)|

≤ p|λ(p; θ) − λ(p; θ′)|

(a)

≤ pK2‖θ − θ′‖∞, (A-18)

where (a) follows from Assumption 2(ii.). Now

0 ≤ r(λ(pu(θ∗); θ∗); θ∗) − r(λ(pu(θ); θ∗); θ∗)

=[r(λ(pu(θ∗); θ∗); θ∗) − r(λ(pu(θ∗); θ); θ)

]+

[r(λ(pu(θ∗); θ); θ) − r(λ(pu(θ); θ); θ)

]

+[r(λ(pu(θ); θ); θ) − r(λ(pu(θ); θ∗); θ∗)

]

(a)

≤ K2p‖θ − θ∗‖∞ + 0 + K2p‖θ − θ∗‖∞,

where (a) follows from (A-18) and the fact that pu(θ) maximizes r(λ(·; θ); θ) on [p, p]. Hence, we

have established that

0 ≤ r(λ(pu(θ∗); θ∗); θ∗) − r(λ(pu(θ); θ∗); θ∗) ≤ 2K2p‖θ − θ∗‖∞. (A-19)

We now turn to analyze pc(θ). Note that under Assumption 1(i.),(ii.), r(·, θ) is Lipschitz with

constant p + MK−1. Using this,

|r(λ(pc(θ∗); θ∗); θ∗) − r(λ(pc(θ); θ∗); θ∗)| ≤ (p + MK−1)|λ(pc(θ∗); θ∗) − λ(pc(θ); θ∗)|. (A-20)

Now note that

|λ(pc(θ∗); θ∗) − λ(pc(θ); θ∗)| ≤ |λ(pc(θ∗); θ∗) − λ(pc(θ); θ)| + |λ(pc(θ); θ) − λ(pc(θ); θ∗)|

(a)

≤ |λ(pc(θ∗); θ∗) − λ(pc(θ); θ)| + K2‖θ − θ∗‖∞,

where (a) follows from Assumption 2(ii.). Let h2(l; θ) denote the projection on Dλ(θ), i.e., h2(l; θ) =

max{min{y, λ(p; θ)}, λ(p; θ)

}and note that λ(pc(θ); θ) = h2(x/T ; θ). It is easy to show that for all

32

y, h2(y, ·) is Lipschitz with constant K2 where the latter was defined in Assumption 2(ii.). Using

this, we have |λ(pc(θ∗); θ∗) − λ(pc(θ); θ)| = |h2(x/T ; θ∗) − h2(x/T ; θ)|, and we deduce that

|λ(pc(θ∗); θ∗) − λ(pc(θ); θ∗)| ≤ 2K2‖θ − θ∗‖∞. (A-21)

Combining (A-20) and (A-21), we get

|r(λ(pc(θ∗); θ∗); θ∗) − r(λ(pc(θ); θ∗); θ∗)| ≤ 2(p + MK−1)K2‖θ − θ∗‖∞. (A-22)

We now turn to analyze the revenues achieved by p = pD(θ). For this purpose, we divide the

analysis into two cases:

Case 1. Suppose first that pu(θ∗) ≥ pc(θ∗), i.e., pD(θ∗) = pu(θ∗).

i.) If pu(θ) ≥ pc(θ), then

|r(λ(pD(θ∗); θ∗); θ∗)− r(λ(p; θ∗); θ∗)| = |r(λ(pu(θ∗); θ∗); θ∗)− r(λ(pu(θ); θ∗); θ∗)| ≤ 2K2p‖θ− θ∗‖∞.

ii.) If pu(θ) < pc(θ) ≤ pu(θ∗), then the fact that r(λ(·; θ∗), θ∗) is nondecreasing to the left of pu(θ∗)

implies that

|r(λ(pD(θ∗); θ∗); θ∗) − r(λ(p; θ∗); θ∗)| = |r(λ(pu(θ∗); θ∗); θ∗) − r(λ(pc(θ); θ∗); θ∗)|

≤ |r(λ(pu(θ∗); θ∗); θ∗) − r(λ(pu(θ); θ∗); θ∗)|

≤ 2K2p‖θ − θ∗‖∞.

iii.) If pu(θ) < pc(θ) and pc(θ) > pu(θ∗), then

|r(λ(pD(θ∗); θ∗); θ∗) − r(λ(p; θ∗); θ∗)| = |r(λ(pu(θ∗); θ∗); θ∗) − r(λ(pc(θ); θ∗); θ∗)|

≤ (p + MK−1)|λ(pu(θ∗); θ∗) − λ(pc(θ); θ∗)|

(a)

≤ (p + MK−1)|λ(pc(θ∗); θ∗) − λ(pc(θ); θ∗)|

(b)

≤ 2(p + MK−1)K2‖θ − θ∗‖∞.

where (a) follows from the fact that pc(θ∗) ≤ pu(θ∗) < pc(θ) and that λ(·; θ∗) is nonincreasing and

(b) follows from (A-21).

Case 2. Suppose now that pu(θ∗) < pc(θ∗), implying that pD(θ∗) = pc(θ∗).

i.) If pc(θ) ≥ pu(θ), then

|r(λ(pD(θ∗); θ∗); θ∗)−r(λ(p; θ∗); θ∗)| = |r(λ(pc(θ∗); θ∗); θ∗)−r(λ(pc(θ); θ∗); θ∗)| ≤ 2(p+MK−1)K2‖θ−θ∗‖∞.

ii.) If pc(θ) < pu(θ), then

r(λ(pD(θ∗); θ∗); θ∗) − r(λ(p; θ∗); θ∗) = r(λ(pc(θ∗); θ∗); θ∗) − r(λ(pu(θ); θ∗); θ∗)

= r(λ(pc(θ∗); θ∗); θ∗) − r(λ(pc(θ); θ∗); θ∗)

+r(λ(pc(θ); θ∗); θ∗) − r(λ(pu(θ∗); θ∗); θ∗)

+r(λ(pc(θ∗); θ∗); θ∗) − r(λ(pu(θ); θ∗); θ∗)

(a)

≤(2(p + MK−1)K2 + 2K2p

)‖θ − θ∗‖∞.

33

where (a) follows from the fact that pu(θ∗) maximizes r(λ(·; θ∗); θ∗) on [p, p] and the inequalities

(A-19) and (A-22). Combining the results from cases 1 and 2, we have established that

r(λ(pD(θ∗); θ∗); θ∗) − r(λ(p; θ∗); θ∗) ≤ C1‖θ − θ∗‖∞, (A-23)

where C1 =(2(p+MK−1)K2 +2K2p

). We are now left with controlling θ. This is done in Lemma

6, whose proof is deferred to Appendix B.

Lemma 6 Under Assumption 2, for some C2 > 0,

E‖θ − θ∗‖∞ ≤C2

(nτn)1/2for all n ≥ 1. (A-24)

Using the result above and (A-23), we have

E[r(λ(p; θ∗); θ∗)

]= r(λ(pD(θ∗); θ∗); θ∗) + E

[r(λ(p; θ∗); θ∗) − r(λ(pD(θ∗); θ∗)); θ∗)

]

≥ r(λ(pD(θ∗); θ∗); θ∗) −C1C2

(nτn)1/2·

Step 3. Finalizing the analysis of the lower bound can be conducted in a similar manner as in

the proof of Proposition 1 (Step 3) by letting in this case un = (log n)1/2(nτn)−1/2.

Step 4. As in the proof of Proposition 1 (Step 4), we get that for some C3 > 0,

(1 −

Jπn

JDn

)≤

C3

mD

(τn + (log n)1/2(nτn)−1/2

), (A-25)

and the result follows by plugging in τn ≍ n−1/3. This completes the proof.

Proof of Proposition 5. Throughout the proof, we let θ∗ denote the true underlying parameter

value. Also, to simplify notation, we let pD := pD(θ∗). The proof is organized as follows. We

first lower bound the expected revenues achieved by the proposed policy. The key issue here is to

account for the performance losses throughout the various stages of the algorithm. This is done

by analyzing the estimates θi and the prices pi at each stage and controlling for the amount of

inventory depleted as well as the revenues generated.

For i = 1, ..., ℓ, let Xin =

∑ij=1 nλ(pj ; θ

∗)∆(j)n , and put Y i

n = N(Xi

n

)− N

(Xi−1

n

). Finally,

let Yn =∑ℓ

i=1 Y in and recall that pD(θ) = max(pu(θ), pc(θ)). We restrict attention to the case

λ(p; θ∗) ≤ x/T (the case of λ(p; θ∗) > x/T can be handled by following arguments similar to those

in the second case of Step 3 in the proof of Proposition 1). The total revenues over the selling

34

horizon under the policy πn can be bounded below as follows

Jπn ≥ E

[p1Y

1n +

ℓn∑

i=2

piYin

]− pE

[(Yn − nx)+

]

= np1λ(p1; θ∗)∆(1)

n +

ℓn∑

i=2

E

[E

[piN(nλ(pi; θ

∗)∆(i)n )

∣∣∣ pi

]]− pE

[(Yn − nx)+

]

= nr(λ(p1; θ∗); θ∗)∆(1)

n + n

ℓn∑

i=2

E

[piλ(pi; θ

∗)]∆(i)

n − pE

[(Yn − nx)+

]

= nr(λ(pD; θ∗); θ∗)

ℓn∑

i=1

∆(i)n − n

[r(λ(pD; θ∗); θ∗) − r(λ(p1; θ

∗); θ∗)]∆(1)

n

−n

ℓn∑

i=2

(r(λ(pD; θ∗); θ∗) − E[r(λ(pi; θ

∗); θ∗)])∆(i)

n − pE

[(Yn − nx)+

](A-26)

= r(λ(pD; θ∗); θ∗)nT − n[r(λ(pD; θ∗); θ∗) − r(λ(p1; θ

∗); θ∗)]∆(1)

n

−n

ℓn∑

i=2

(r(λ(pD; θ∗); θ∗) − E[r(λ(pi; θ

∗); θ∗)])∆(i)

n − pE

[(Yn − nx)+

], (A-27)

where pi, i = 2, ..., ℓn were defined in Algorithm 3 (Step 2(b)).

Note that (r(λ(pD; θ∗); θ∗)−r(λ(p1; θ∗); θ∗))∆

(1)n ≤ r(λ(pD; θ∗); θ∗)∆

(1)n ≤ C1n

aℓ−1, where C1/2 =

Mp bounds the revenue rate. The main task is to derive bounds on r(λ(pD; θ∗); θ∗)−E[r(λ(pi; θ∗); θ∗)].

Using Lemma 6 and a parallel reasoning to the one in Step 2 of the proof of Proposition 3, we have

that for some C2 > 0

r(λ(pD; θ∗); θ∗) − E[r(λ(pi; θ∗); θ∗)] ≤

C2

(n∆(i−1)n )1/2

,

and hence,

ℓn∑

i=2

(r(λ(pD; θ∗); θ∗) − E[r(λ(pi; θ

∗); θ∗)])∆(i)

n ≤ C1

ℓn∑

i=2

∆(i)n

(n∆(i−1)n )1/2

= C2β1/2

ℓn∑

i=2

naℓ−1 ≤ C2T1/2ℓnnaℓ−1,

where the last inequality follows since ∆(ℓ)n = β and hence β ≤ T . We deduce that

n[r(λ(pD; θ∗); θ∗) − r(λ(p1; θ

∗); θ∗)]∆(1)

n

+n

ℓn∑

i=2

(r(λ(pD; θ∗); θ∗) − E[r(λ(pi; θ

∗); θ∗)])∆(i)

n + pE[(Yn − nx)+]

≤ n max{C2T1/2, C1}ℓnnaℓ−1 + pE[(Yn − nx)+]. (A-28)

35

Let ǫn be a sequence of positive numbers to be defined later. We have

E

[(Yn − nx)+

]= E

[(Yn − nx)+; Yn − nx ≤ nǫn

]+ E

[(Yn − nx)+; Yn − nx > nǫn

]

≤ nǫn + E

[(Yn − nx)+ | Yn − nx > nǫn

]P(Yn − nx > nǫn)

(a)

≤ nǫn + (nǫn + nM + 1)P(Yn − nx > nǫn),

where (a) follows from the fact that for a Poisson random variable Z with mean µ, E[Z | Z > a] ≤

a + 1 + µ. Now note that nx = n(x/T )T ≥ nλ(pc; θ∗)∑ℓn

i=1 ∆(i)n . (Note that under the assumption

that λ(p; θ∗) ≤ x/T and since λ(·; θ∗) is continuous and decreasing, either λ(p; θ∗) ≥ x/T in which

case λ(pc; θ∗) = x/T or λ(p; θ∗) < x/T in which case λ(pc) < x/T .) Hence we have

P(Yn − nx > ℓnnǫn) ≤ P

{N(nλ(p1; θ

∗)∆(1)n ) > nλ(pc; θ∗)∆(1)

n + nǫn

}

+

ℓn∑

i=2

P

{N(nλ(pi; θ

∗)∆(i)n ) > nλ(pc; θ∗)∆(i)

n + nǫn

}. (A-29)

Choose ǫn = C3(log n)1/2∆(1)n = C3(log n)1/2naℓn−1, where C3 > 0 is suitably large. Note that

∆(1)n /ǫn → 0 as n → ∞, hence by Lemma 2 the first term on the RHS of (A-29) is bounded from

above by C4/nη for some C4 > 0 and some η ≥ 2. Consider any term in the sum which constitutes

the second term on the RHS of (A-29). Using the same reasoning as the one leading to (B-10), we

have for some C5, C6 > 0

P

{λ(pi; θ

∗) − λ(pc; θ∗) >C5

(∆(i−1)n )1/2

}≤

C6

nη−1· (A-30)

Now, for i = 2, ..., ℓ

P

{N(nλ(pi; θ

∗)∆(i)n ) > nλ(pc; θ∗)∆(i)

n + nǫn

}

≤ P

{N(nλ(pi; θ

∗)∆(i)n ) > nλ(pc; θ∗)∆(i)

n + nǫn , λ(pi; θ∗) ≤ λ(pc; θ∗) +

C6

(n∆(i−1)n )1/2

}

+P

{λ(pi; θ

∗) − λ(pc; θ∗) >C6

(n∆(i−1)n )1/2

}

(a)

≤ P

{N(nλ(pc; θ∗) + nC6∆

(i)n /(∆(i−1)

n )1/2) > nλ(pc; θ∗)∆(i)n + nǫn

}+

C6

nη−1

≤ P

{N(nλ(pc; θ∗) + C6n

aℓ) − (nλ(pc; θ∗)∆(i)n + C6n

aℓ) > −nC6naℓ−1 + nǫn

}+

C6

nη−1

(b)

≤ P

{N(nλ(pc; θ∗) + C6n

aℓ) − (nλ(pc; θ∗)∆(i)n + C6n

aℓ) >1

2nǫn

}+

C6

nη−1,

where (a) follows from (A-30) and (b) follows from the definition of the sequence ∆(i)n , i = 1, ..., ℓ.

Appealing to Lemma 2 yields that the first term on the RHS above can be bounded by C7/nη for

some C7 > 0. We conclude that for some C8 > 0,

E

[(Yn − nx)+

]≤ nC8(log n)1/2naℓn−1,

36

and this in turn, in conjunction with (A-27) and (A-28), implies that for some C9 > 0

Jπn

JDn

≥ 1 −C9

mD

[ℓnnaℓn−1 + (log n)1/2naℓn−1

]. (A-31)

By taking aℓn= 2ℓn−1/(2ℓn − 1) with ℓn = (log 2)−1 log log n, we get that naℓn−1 ≤ exp(1)n−1/2.

Plugging in (A-31), we get the final assertion of the proposition. A similar result holds when

λ(p, θ∗) > x/T and the the proof is thus complete.

B Proofs of Auxiliary Results

In what follows, C ′i, i = 1, 2, . . . will denote positive constants that depend only on x, T and the

class L (but not on a specific function λ ∈ L).

Proof of Lemma 1. Fix λ ∈ L and let Pλ denote the class of policies that “know” the demand

function prior the the start of the selling season, and whose corresponding price process is non

anticipating and satisfies the admissibility conditions (2)-(3). Put

J∗(x, T |λ) := supπ∈Pλ

E

[∫ T

0p(s)dNπ(s)

], (B-1)

Problem (B-1) would be the one that the decision maker would solve if s/he would know the demand

function λ prior to the start of the selling season. Clearly Jπ(x, T ; λ) ≤ J∗(x, T |λ) for all π ∈ P.

Hence, to establish the bound of the lemma, it is sufficient to show that J∗(x, T |λ) ≤ JD(x, T |λ).

Consider the optimization problem given in (5) and its equivalent formulation in terms of rates:

JD(x, T |λ) = max{∫ T

0r(λ(s))ds :

∫ T

0λ(s)ds ≤ x , λ(s) ∈ [l, l] ∪ {0} for all 0 ≤ s ≤ T

}. (B-2)

If l = 0, then the set of feasible rates is convex and Gallego and van Ryzin (1994, Theorem 2) yields

the result. Suppose now that l > 0 and let

λc = arg minl∈[l,l]

|l − x/T | , λu = arg maxl∈[l,l]

r(l) , λD = min{λu, λc}.

Claim 1: If x/T ≥ l, then setting λ(s) = λD for 0 ≤ s ≤ T is an optimal solution to the

deterministic problem (B-2). In addition the value of the deterministic problem serves as an upper

bound on the original stochastic problem.

To verify the claim we consider two cases. Suppose first that x/T ∈ [l, l], in which case λc = x/T .

If λu ≤ x/T , then applying λu throughout is feasible and is hence optimal. In addition, the optimal

value of the deterministic problem is given by r(λu)T which is an upper bound on the stochastic

problem (since r(l) ≤ r(λu) for all l ∈ [l, l]) . If λu > x/T , then λu > l and hence r(·) is necessarily

non-decreasing on [0, λu] (by the concavity assumption). Thus if we relax the domain of feasible

rates to the convex set [0, l], applying λD throughout is an optimal solution (see Gallego and van

37

Ryzin (1994)). This is a feasible solution for the constrained problem and is hence optimal for that

problem as well. Let JD[0, l] be the value of the deterministic problem with feasible rates in [0, l]

and let J∗[0, l] be the value of the stochastic problem. We clearly have that

JD[l, l](a)= JD[0, l]

(b)

≥ J∗[0, l](c)

≥ J∗[l, l],

where: (a) follows from the result we just established; (b) follows from Gallego and van Ryzin

(1994, Theorem 2); and (c) is a clear consequence of constraint relaxation.

Suppose now that x/T > l. In this case, λu ≤ l < x/T . Applying λu throughout is feasible and

hence optimal. The optimal value of the deterministic problem is hence given by r(λu)T which is

an upper bound on the stochastic problem. The assertion of the claim follows.

Claim 2: If x/T < l, then λ(s) = λD = l for s ∈ [0, T ′] and λ(s) = 0 for s ∈ (T ′, T ], where

T ′ = x/l, is an optimal solution to the deterministic relaxation.

To verify the claim, note that all units are sold at the maximum allowable price with the proposed

solution and hence the revenues achieved are xp, which is a clear upper bound on the performance

of any solution of the deterministic or the stochastic problem. Thus the assertion of the claim

follows.

Combining the two claims yields the result of the lemma in the case l > 0.

Proof of Lemma 2. If µ = 0, then result is immediate. Now suppose µ ∈ (0, M ]. Using a

Chernoff bound, we have for any θ > 0

P


)≤ exp

{−θrn(µ + ǫn) + (exp{θ} − 1)µrn

}. (B-3)

The expression in each of the exponents is minimized by setting θ = log(1 + ǫn/µ

). Plugging this

into (B-3) yields

P


)≤ exp

{− log

(1 +

ǫn

µ

)(µ + ǫn) + ǫn

}

≤ exp{

rn

(− log

(1 +

ǫn

M

)(M + ǫn) + ǫn

)}. (B-4)

For the last inequality, note that the derivative of the term in the exponent with respect to µ is

given by − log(1+ ǫn/µ)+ ǫn/µ, which is always positive for ǫn > 0. Now, using a Taylor expansion

we get that for some ξ ∈ [0, ǫn],

− log(1 +

ǫn

M

)(M + ǫn) + ǫn = −

1

2

1

1 + ξ

ǫ2nM

.

If ǫn ≤ 1, we have

−1

2

1

1 + ξ

ǫ2nM

≤ −ǫ2n4M

.

If ǫn > 1, then

−1

2

1

1 + ξ

ǫ2nM

≤ −1

2M

ǫn

1 + ǫn≤ −

1

4M.

38

Letting C(η) = 2η1/2M1/2 and substituting for ǫn, we get

P


)≤ exp

{−rn min

{ 1

4M,C2(η) log n

4Mrn

}}= max

{exp

{−rn

4M

},

1

nη

},

Now it is easy to check that exp{−rn/4M} ≤ (4Mη/β)1/β exp{−η/β}1/nη. Hence the first result

follows with C1 = max{1, (4Mη/β)1/β exp{−η/β}}. The other inequality is established using

identical arguments. This completes the proof.

Proof of Lemma 3. The proof is divided into three main steps. The first step analyzes key

properties of the estimate pc and establishes (A-5). The second step focuses on pu, and the third

step uses the results from the previous two steps to derive properties of p = max{pu, pc} and

establishes (A-4).

Step 1. Here, we focus on pc and show the following. Put C ′1 = (M + Kp)C ′

2 where C ′2 =

K−1 max{8η1/2M1/2, 2K(p − p)}, then for some C ′3 > 0 and for all n ≥ 1

P{|r(λ(pc)) − r(λ(pc))| > C ′

1un

}≤

C ′3

nη−1, (B-5)

P{|pc − pc| > C ′

2un

}≤

C ′3

nη−1· (B-6)

For each n, let [pi, pi+1] be the interval that contains pc. Here, we drop the dependence on n to

avoid cluttering the notation. Now

|λ(pc) − λ(pc)| ≤ |λ(pc) − λ(pc)| + |λ(pc) − λ(pc)|

(a)

≤ |λ(pc) − λ(pc)| + |λ(pi) − λ(pc)|

(b)

≤ 2 max1≤k≤κn

|λ(pk) − λ(pk)| + |λ(pi) − λ(pc)|

(c)

≤ 2 max1≤k≤κn

|λ(pk) − λ(pk)| +K(p − p)

κn,

where (a) follows from the definition of pc = arg min1≤j≤κn

∣∣λ(pj) − λ(pc)∣∣, (b) follows from the

triangle inequality, and (c) follows from the fact that λ(·) is K-Lipschitz (Assumption 1(ii.)). We

have

P

{∣∣pc − pc∣∣ > C ′

2un

} (a)

≤ P

{∣∣λ(pc) − λ(pc)∣∣ > C ′

2Kun

}

≤ P

{max

1≤k≤κn

|λ(pk) − λ(pk)| >(C ′

2K)un

2−

K(p − p)

2κn

}

(b)

≤

κn∑

k=1

P

{|λ(pk) − λ(pk)| >

(C ′2K)un

4

}

(c)

≤C ′

3

nη−1,

39

where (a) follows from the fact that γ(·) is K−1-Lipschitz (Assumption 1(ii.)), (b) follows from a

union bound and the fact that C ′2 ≥ 2K(p−p)K−1 and and (c) follows from Lemma 2 in conjunction

with the inequality C ′2/(4K−1) ≥ 2η1/2M1/2. Note that (A-5) has now been established by setting

C3 = C ′2. The inequality for revenues now follows. Indeed, note that under Assumption 1, r(λ(·))

is Lipschitz with constant M + pK. Hence,∣∣r(λ(pc)) − r(λ(pc))

∣∣ ≤ (M + Kp)∣∣pc − pc

∣∣, and

P

{∣∣r(λ(pc)) − r(λ(pc))∣∣ > C ′

1un

}≤ P

{∣∣pc − pc∣∣ > C ′

1un/(M + Kp)}

= P

{∣∣pc − pc∣∣ > C ′

2un

}

≤C ′

3

nη−1·

This concludes the proof of (B-6) and (B-5). In Step 2, we derive properties of pu which are

subsequently used in Step 3 to prove (A-4).

Step 2. Here, we focus on pu and show the following. Put C ′4 = max{8η1/2M1/2p, 2M + pK},

then for some C ′5 > 0 and all n ≥ 1

P{r(λ(pu)) − r(λ(pu)) > C ′

4un

}≤

C ′5

nη−1, (B-7)

First, for each n, let j be the index such that pu ∈ [pj , pj+1]. Recall that under Assumption 1, r(λ(·))

is Lipschitz with constant M +pK. This, in conjunction with the fact that |pu−pj+1| ≤ (p−p)/κn

yields∣∣r(λ(pu)) − r(λ(pj+1))

∣∣ ≤(M + pK)(p − p)

κn· (B-8)

We now establish an upper bound on the difference in revenues r(λ(pu)) − r(λ(pu)) as follows

r(λ(pu)) − r(λ(pu)) = r(λ(pu)) − pj+1λ(pj+1) + pj+1λ(pnj+1) − puλ(pu) + puλ(pu) − puλ(pu)

(a)

≤ r(λ(pu)) − pj+1λ(pj+1) + puλ(pu) − puλ(pu)

≤∣∣r(λ(pu)) − r(λ(pj+1))

∣∣ + 2 max1≤k≤κn

∣∣pkλ(pk) − pkλ(pk)∣∣

(b)

≤(M + pK)(p − p)

κn+ 2 max

1≤k≤κn

∣∣pkλ(pk) − pkλ(pk)∣∣,

where (a) follows from the definition of pu given in (10), and (b) follows from (B-8). Now

P

{r(λ(pu)) − r(λ(pu)) > C ′

4un

}≤ P

{max

1≤k≤κn

∣∣pkλ(pk) − pkλ(pk)∣∣ >

un

2−

(M + pK)(p − p)

2κn

}

(a)

≤ P

{max

1≤k≤κn

∣∣pkλ(pk) − pkλ(pk)∣∣ >

C ′4un

4

}

(b)

≤

κn∑

k=1

P

{∣∣λ(pk) − λ(pk)∣∣ >

C ′4un

4p

}

(c)

≤κnC ′

5

nη

≤C ′

5

nη−1,

40

where (a) follows from the fact that (C ′4/2)un ≥ C ′

3(p− p)/κn, (b) follows from a union bound and

(c) follows from the fact that C ′4/(4p) ≥ 2η1/2M1/2 and Lemma 2. We have now established (B-7).

Step 3. In this last step, we now turn to analyze the revenue function evaluated at p =

max{pu, pc}, which is an estimate of pD = max{pu, pc}. Define C ′7 = max{2C ′

1, 2C ′4, C

′2(M + Kp)}.

We divide the analysis in two cases: pu ≥ pc and pu < pc.

Case 1. Suppose first that pu ≥ pc, i.e., pD = pu.

i.) If pu ≥ pc, then r(λ(pD)) − r(λ(p)) = r(λ(pu)) − r(λ(pu)).

ii.) If pu < pc ≤ pu, then r(λ(pD)) − r(λ(p)) = r(λ(pu)) − r(λ(pc)) ≤ r(λ(pu)) − r(λ(pu)), where

the last inequality follows from the fact that r(λ(·)) is non-decreasing on [pu, pu]. Since, C ′7 ≥ C ′

4,

in both i.) and ii.), we have by (B-7) that P{r(λ(pD)) − r(λ(p)) > C ′

7un

}≤ C ′

5/nη−1.

iii.) The last subcase we analyze is when pu < pc and pc > pu. In this case, note that

P{r(λ(pD)) − r(λ(p)) > C ′

7un

}= P

{r(λ(pu)) − r(λ(pc)) > C ′

4un

}

≤ P{r(λ(pu)) − r(λ(pc)) > C ′

7un ; |pc − pc| ≤ C ′2un

}

+P{|pc − pc| > C ′

2un

}

(a)

≤ P{(M + Kp)|pu − pc| > C ′

7un ; |pc − pc| ≤ C ′2un

}+

C ′3

nη−1

(b)

≤ P{(M + Kp)|pc − pc| > C ′

7un ; |pc − pc| ≤ C ′2un

}+

C ′3

nη−1

(c)=

C ′3

nη−1,

where (a) follows the fact that r(λ(·)) is (M +Kp)-Lipschitz under Assumption 1, and (B-6). Note

that (b) stems from the fact that in case iii.), |pu − pc| ≤ |pc − pc| since pc > pu ≥ pc and (c) follows

since C ′7/(M + Kp) ≥ C ′

2. In all subcases, we have

P{r(λ(pD)) − r(λ(p)) > C ′

7un

}≤

max{C ′5, C

′3}

nη−1.

Case 2. We now turn to the case where pu < pc, implying that pD = pc.

i.) If pc ≥ pu, then r(λ(pD)) − r(λ(p)) = r(λ(pc)) − r(λ(pc)). Hence,

P{r(λ(pD)) − r(λ(p)) > C ′

7un

}= P

{r(λ(pc)) − r(λ(pc))u)) > C ′

7un

} (a)

≤C ′

3

nη−1,

where (a) follows from (B-5) and the fact that C ′7 ≥ C ′

1.

ii.) If pc < pu, then r(λ(pD)) − r(λ(p)) = r(λ(pc)) − r(λ(pu)) = r(λ(pc)) − r(λ(pc)) + r(λ(pc)) −

r(λ(pu)) + r(λ(pu)) − r(λ(pu)) ≤ r(λ(pc)) − r(λ(pc)) + r(λ(pu)) − r(λ(pu)), where the inequality

follows from the definition of pu. We deduce that

P{r(λ(pD)) − r(λ(p)) > C ′

7un

}≤ P

{r(λ(pc)) − r(λ(pc)) + r(λ(pu)) − r(λ(pu)) > C ′

7un

}

≤ P{r(λ(pc)) − r(λ(pc)) > C ′

7un/2}

+ P{r(λ(pu)) − r(λ(pu)) > C ′

7un/2}

(a)

≤C ′

5 + C ′3

nη−1,

41

where (a) follows from (B-5) and (B-7) and the fact that C ′7 ≥ 2 max{C ′

1, C′4}.

Putting together the results from cases 1 and 2, we have shown that

P{r(λ(pD)) − r(λ(p)) > C ′

7un

}≤

C ′5 + C ′

3

nη−1,

which establishes (A-4). The proof is now complete.

Proof of Lemma 4. Put C ′1 = 4 max{KTC2, M, 2η1/2M1/2} where C2 was defined in Lemma 3.

First, using the fact that Yn = Y(L)n + Y

(P )n , note that

P(Yn − nx > C ′

1nun

)≤ P

(Y (L)

n > C ′1nun/2

)+ P

(Y (P )

n > nx + C ′1nun/2

). (B-9)

Prior to bounding the second term on the RHS above, note that since pc − p = pc −max{pu, pc} ≤

pc − pc and hence

P{pc − p > C2un

}≤ P

{pc − pc > C2un

}≤

C3

nη−1,

where the last inequality follows from Lemma 3. Let C ′2 = KC2. Recalling that λ(·) is monotone

decreasing and K-Lipschitz, we have that λ(p) − λ(pc) > C ′2un implies that pc − p > C2un. Hence

P

{λ(p) > λ(pc) + C ′

2un

}≤ P

{pc − p > C2un

}≤

C3

nη−1· (B-10)

Coming back to the second term on the RHS of (B-9), we have

P

(Y (P )

n > nx + C ′1nun/2

)

≤ P

(N

(λ(p)n(T − τn)

)> nx + C ′

1nun/2 , λ(p) ≤ λ(pc) + C ′2un

)+ P

(λ(p) > λ(pc) + C ′

2un

)

(a)

≤ P

(N

((λ(pc) + C ′

2un)nT)

> nx + C ′1nun/2

)+

C3

nη−1,

(b)

≤ P

(N

((x/T + C ′

2un)nT)

> nx + C ′1nun/2

)+

C3

nη−1,

= P

(N

(x(1 + C ′

2Tun/x)n)− x(1 + C ′

2Tun/x)n > −C ′2Tnun + C ′

1nun/2)

+C3

nη−1,

where (a) follows from (B-10), and (b) follows from the fact that λ(pc) ≤ x/T . For the latter, note

that under the assumption that λ(p) ≤ x/T and since λ(·) is continuous and decreasing, either

λ(p) ≥ x/T in which case λ(pc) = x/T or λ(p) < x/T in which case λ(pc) < x/T . Now, using the

fact that C ′1 − C ′

2T ≥ C ′2T , we have

P

(N

((λ(pc) + un)nT

)> nx + C ′

1nun/2)

≤ P

(N

(x(1 + C ′

2Tun/x)n)− x(1 + C ′

2Tun/x)n > C ′2Tnun

)

≤C ′

3

nη, (B-11)

42

where C ′3 is suitably large and we have used Lemma 2 for the last inequality (since un/((log n)1/2n−1/2)

as n → ∞). The first term on the RHS of (B-9) can be bounded using Lemma 2.

P

(Y (L)

n > C ′1nun/2

)= P

(N

( κn∑

i=1

λ(pi)n∆n

)> C ′

1nun/2)

≤

κn∑

i=1

P

(N(λ(pi)n∆n) >

C ′1nun

2κn

)

(a)

≤ κnP

(N(Mn∆n) − Mn∆n >

C ′1nun

4κn

)

(b)

≤ κnC ′

3

nη

(c)

≤C ′

4

nη−1,

where C ′4 > 0 is suitably large, (a) follows from the fact that C ′

1nun/(4κn) ≥ Mn∆n, (b) follows

from Lemma 2 and the fact that C ′1nun/(4κn) ≥ 2η1/2M1/2(log n)1/2(n∆n)1/2 and (c) follows from

the fact that κn = o(n). Now, combining the above, we have

P(Yn − nx > C ′

1nun

)≤

C3 + C ′3 + C ′

4

nη−1. (B-12)

The result is now established.

Proof of Lemma 5. Note that A = {ω : Y(P )n ≥ nx−C9nun, Y

(L)n ≤ C9nun, |p− pD| ≤ C2un}.

Now,

P(Ac)(a)

≤ P(Y (P )n < nx − C9nun) + P(Y (L)

n > C9nun) + P(|p − pD| > C2un)

(b)

≤ P(Y (P )n < nx − C9nun) + P(Y (L)

n > C9nun) +C3

nη−1, (B-13)

where (a) follows from a union bound and (b) follows from Lemma 3 (since |p − pD| = |p − p| ≤

|pc − p| = |pc − pc|). We now focus on the first term on the RHS of (B-13).

P(Y (P )n < nx − C9nun) = P(N(λ(p)n(T − τn)) < nx − C9nun)

≤ P(N(λ(p)n(T − τn)) < nλ(p)T − nǫn)

≤ P(N(λ(p)n(T − τn)) − λ(p)n(T − τn) < n(Mτn − C9un))

(a)

≤C ′

1

nη−1,

where (a) follows from Lemma 2 as long C9 is chosen large enough and C ′1 > 0 is suitably large.

We now turn to the second term on the RHS of (B-13).

P(Y (L)n > C9nun) ≤

κn∑

i=1

P(λin∆n > C9n/κnun)

≤

κn∑

i=1

P(λin∆n − λ(pi)n∆n > C9n/κnun − Mn∆n)

(a)

≤ κnC ′

2

nη−1≤

C ′2

nη−1,

43

where (a) follows from Lemma 2 as long as C9 is chosen large enough and C ′2 > 0 is suitably chosen.

Coming back to (B-13), we get

P(Ac) ≤C ′

1 + C ′2 + C3

nη−1, (B-14)

and the result follows.

Proof of Lemma 6. Consider each interval [(i−1)∆n, i∆n] for i = 1, ..., k. Suppose each is divided

into ⌊n∆n⌋ intervals of length 1/n each and measurements are taken every 1/n units of time. Let

Xin = (xi

1, ..., xi⌊n∆n⌋

) denote the vector of total demand observed in each interval. Focusing on the

ith interval where price pi is applied, and letting µi = λ(pi, θ), the log-likelihood function can be

written as

Lin(Xi

n, µi) =

⌊n∆n⌋∑

j=1

log fµi(xi

j).

This expression is maximized by µi =∑⌊n∆n⌋

j=1 xij/∆n. This follows since for a Poisson process with

unknown constant intensity µ, given n observations of the number of events in non-overlapping

intervals of unit length (y1, ..., yn), the ML estimate µ is given by µ = (y1 + ... + yn)/n.

Now note that for a Poisson process with unknown intensity µ ∈ (l0, M ], the Fisher information

I is bounded and positive, where I is given by I = Eµ

[∂ log fµ(x)/∂µ

]2. In addition, note that for

any real number s ≥ 2, we have η := Eµ|∂ log fµ(x)/∂µ|s = supµ∈(l0,M ] µ−s

Eµ|x− µ|s < ∞. Hence,

under Assumption 2, the conditions of Theorem 3 in Borovkov (1998, II.36) are satisfied and we

have that there exists values 0 < c < ∞ and δ > 0 such that for all v and n ≥ 1,

P

((n∆n)1/2|µi − µ∗

i | ≥ v)≤ ce−δv2

. (B-15)

Setting µi = λ(pi, θ∗) where θ∗ is the true parameter and using condition (i.)(b) in Assumption 2,

we deduce that

P

((n∆n)1/2‖θ − θ∗‖∞ ≥ αv

)≤

k∑

i=1

P

((n∆n)1/2|µi − µ∗

i | ≥ v)≤ cke−δv2

, (B-16)

and in turn,

E‖θ − θ∗‖∞ =

∫ ∞

0P{(n∆n)1/2‖θ − θ∗‖∞ > (n∆n)1/2s

}ds

≤ a +

∫ ∞

acke−δn∆ns2/α2

ds

≤ a +ckα2

2δn∆nae−δn∆na2/α2

.

Taking a = (n∆n)−1/2, one gets

E‖θ − θ∗‖∞ ≤C ′

1

(nτn)1/2, (B-17)

where C ′1 = 1 + ckα2/(2δ)e−δ/α2

. This concludes the proof.

44

References

Araman, V. F. and Caldentey, R. A. (2005), ‘Dynamic pricing for non-perishable products withdemand learning’, working paper, New York University .

Auer, P., Cesa-Bianchi, N., Freund, Y. and Schapire, R. E. (2002), ‘The nonstochastic multiarmedbandit problem’, SIAM Journal of Computing 32, 48–77.

Aviv, Y. and Pazgal, A. (2005), ‘Pricing of short life-cycle products through active learning’,working paper, Washington University .

Ball, M. and Queyranne, M. (2006), ‘Toward robust revenue management: Competitive analysis ofonline booking’, working paper, University of Maryland .

Benveniste, A., Metivier, M., and Priouret, P. (1990), Adaptive algorithms and stochastic approxi-mations, Springer Verlag.

Bertsimas, D. and Perakis, G. (2003), ‘Dynamic pricing: A learning approach’, working paper,Massachusetts Institute of Technology .

Bitran, G. and Caldentey, R. (2003), ‘An overview of pricing models for revenue management’,Manufacturing & Service Operations Management 5, 203–229.

Borovkov, A. (1998), Mathematical Statistics, Gordon and Breach Science Publishers.

Carvalho, A. X. and Puterman, M. L. (2005), ‘Dynamic pricing and learning over short timehorizons’, working paper, University of British Columbia .

Cesa-Bianchi, N. and Lugosi, G. (2006), Prediction, learning, and games, Cambridge UniversityPress.

Elmaghraby, W. and Keskinocak, P. (2003), ‘Dynamic pricing in the presence of inventory con-siderations: Research overview, current practices and future directions’, Management Science49, 1287–1309.

Eren, S. and Maglaras, C. (2007), ‘Pricing without market information’, working paper, ColumbiaUniversity .

Farias, V. F. and Van Roy, B. (2006), ‘Dynamic pricing with a prior on market response’, workingpaper, Stanford University .

Foster, D. P. and Vohra, R. (1999), ‘Regret in the on-line decision problem’, Games and EconomicBehavior 29, 7–35.

Gallego, G. and van Ryzin, G. (1994), ‘Optimal dynamic pricing of inventories with stochasticdemand over finite horizons’, Management Science 50, 999–1020.

Hannan, J. (1957), ‘Approximation to Bayes risk in repeated play’, Contributions to the Theory ofGames, Princeton University Press III, 97–139.

Huh, W. T. and Rusmevichientong, P. (2006), ‘An adaptive algorithm for multiple-fare-class ca-pacity control problems’, working paper, Columbia University .

Kiefer, J. and Wolfowitz, J. (1952), ‘Stochastic estimation of the maximum of a regression function’,Annals of Mathematical Statistics 23, 462–466.

Kleinberg, R. and Leighton, F. T. (2003), ‘The value of knowing a demand curve: Bounds on regretfor online posted-price auctions’, Proc. of the 44th Annual IEEE Symposium on Foundationsof Computer Science .

45

Kushner, H. and Sanvicente, E. (1975), ‘Stochastic approximation of constrained systems withsystem and constraint noise’, Automatica 11, 375–380.

Lai, T. L. and Robbins, H. (1985), ‘Asymptotically efficient adaptive allocation rules’, Advances inApplied Mathematics 6, 4–22.

Lim, A. and Shanthikumar, J. (2006), ‘Relative entropy, exponential utility, and robust dynamicpricing’, Forthcoming Operations Research .

Lobo, M. S. and Boyd, S. (2003), ‘Pricing and learning with uncertain demand’, working paper,Duke University .

Perakis, G. and Roels, G. (2006), ‘Distribution-free revenue management’, 6th Informs RevenueManagement Conference, June 2006, http://www.demingcenter.com .

Polyak, B. T. and Juditsky, A. B. (1990), ‘Acceleration of stochastic approximation by averaging’,SIAM J. Control and Optimization 30, 838–855.

Polyak, B. T. and Tsybakov, A. (1990), ‘Optimal order of accuracy of search algorithms in stochasticoptimization’, Problems of Information Transmission 26, 126–133.

Robbins, H. (1952), ‘Some aspects of the sequential design of experiments’, Bull. Amer. Math. Soc.58, 527–535.

Robbins, H. and Monro, S. (1951), ‘A stochastic approximation method’, Annals of MathematicalStatistics 22, 400–407.

Rusmevichientong, P., Van Roy, B. and Glynn, P. W. (2006), ‘A non-parametric approach tomulti-product pricing’, Operations Research 54, 82–98.

Scarf, H. (1959), ‘Bayes solutions of the statistical inventory problem’, Annals of MathematicalStatistics 30, 490–508.

Talluri, K. T. and van Ryzin, G. J. (2005), Theory and Practice of Revenue Management, Springer-Verlag.

van Ryzin, G. and McGill, J. (2000), ‘Revenue management without forecasting or optimization:An adaptive algorithm for determining airline seat protection levels’, Management Science46, 760–775.

46

dynamic pricing without knowing the demand function: risk ......dynamic pricing without knowing the...

Documents