improving design preference prediction accuracy using

Improving Design Preference PredictionAccuracy using Feature Learning

Alex Burnap∗Design Science

University of MichiganAnn Arbor, Michigan 48109

Email: [email protected]

Yanxin Pan∗Design Science

University of MichiganAnn Arbor, Michigan 48109Email: [email protected]

Ye LiuComputer Science and Engineering


Yi RenMechanical EngineeringArizona State University

Tempe, AZ 85287Email: [email protected]

Honglak LeeComputer Science and Engineering

University of MichiganAnn Arbor, Michigan 48109

Email: [email protected]

Richard GonzalezPsychology


Panos Y. PapalambrosMechanical EngineeringUniversity of Michigan

Ann Arbor, Michigan 48109Email: [email protected]

Quantitative preference models are used to predict customerchoices among design alternatives by collecting prior pur-chase data or survey answers. This paper examines how toimprove the prediction accuracy of such models without col-lecting more data or changing the model. We propose to usefeatures as an intermediary between the original customer-linked design variables and the preference model, transform-ing the original variables into a feature representation thatcaptures the underlying design preference task more effec-tively. We apply this idea to automobile purchase decisionsusing three feature learning methods (principal componentanalysis, low rank and sparse matrix decomposition, and ex-ponential sparse restricted Boltzmann machine), and showthat the use of features offers improvement in prediction ac-curacy using over 1 million real passenger vehicle purchasedata. We then show that the interpretation and visualizationof these feature representations may be used to help augmentdata-driven design decisions.

1 IntroductionMuch research has been devoted to develop design pref-

erence models that predict customer design choices. A com-mon approach is to: (i) Collect a large database of previouspurchases that includes customer data, e.g., age, gender, in-come, and purchased product design data, e.g., # cylinders,length, curb weight — for an automobile; and (ii) statisti-

∗Authors contributed equally to this work.

cally infer a design preference model that links customer andproduct variables, using conjoint analysis or discrete choiceanalysis such as logit, mixed logit, and nested logit mod-els [1, 2].

However, a customer may not purchase a vehicle solelydue to interactions between these two sets of variables, e.g.,a 50-year old male prefers 6-cylinder engines. Instead, a cus-tomer may purchase a product for more ‘meaningful’ designattributes that are functions of the original variables, suchas environmental sustainability or sportiness [3, 4]. Thesemeaningful intermediate functions of the original variables,both of the customer and of the design, are hereafter termedfeatures. We posit that using customer and product features,instead of just the original customer and product variables,may increase the prediction accuracy of the design prefer-ence model.

Our goal then is to find features that improve this prefer-ence prediction accuracy. To this end, one common approachis to ask design and marketing domain experts to choosethese features intuitively, such as a design’s social context [5]and visual design interactions [6]. For example, eco-friendlyvehicles may be a function of miles per gallon (MPG) andemissions, whereas environmentally active customers maybe a function of age, income, and geographic region. An al-ternative explored in this paper is to find features ‘automat-ically’ using feature learning methods studied in computerscience and statistics. As shown in Figure 1, feature learn-

Fig. 1. The concept of feature learning as an intermediate map-ping between variables and a preference model. The diagram ontop depicts conventional design preference modeling (e.g., conjointanalysis) where an inferred preference model discriminates betweenalternative design choices for a given customer. The diagram on bot-tom depicts the use of features as an intermediate modeling task.

ing methods create an intermediate step between the originaldata and the design preference model by forming a more ef-ficient “feature representation” of the original data. Certainwell-known methods such as principal component analysismay be viewed similarly, but more recent feature learningmethods have shown impressive results in 1D waveform pre-diction [7] and 2D image object recognition [8].

We conduct an experiment on automobile purchasingpreferences to assess whether three feature learning methodsincrease design preference prediction accuracy: (1) principlecomponent analysis, (2) low-rank + sparse matrix decom-position, and (3) exponential family sparse restricted Boltz-mann machines [9, 10]. We cast preference prediction as abinary classification task by asking the question, “given cus-tomer x, do they purchase vehicle p or vehicle q.” Our dataset is comprised of 1,161,056 data points generated from5582 real passenger vehicle purchases in the United Statesduring model year 2006 (MY2006).

The first contribution of this work is an increase ofpreference prediction accuracy by 2%-7% just using simple“single-layer” feature learning methods, as compared withthe original data representation. These results suggest fea-tures indeed better represent the customer’s underlying de-sign preferences, thus offering deeper insight to inform de-cisions during the design process. Moreover, this finding iscomplementary to recent work in crowdsourced data gather-ing [11,12] and nonlinear preference modeling [13,14]) sincethey do not affect the preference model or data set itself.

The second contribution of this work is to show howfeatures may be used in the design process. We show thatfeature interpretation and feature visualization offer design-ers additional tools for augmenting design decisions. First,we interpret the most influential pairings of vehicle featuresand customer features to the preference task, and contrastthis with the same analysis using the original variable rep-resentation. Second, we visualize the theoretically optimalvehicle for a given customer within the learned feature rep-

resentation, and show how this optimal vehicle, which doesnot exist, may be used to suggest design improvements uponcurrent models of vehicles that do exist in the market.

Methodological contributions include being the first touse recent feature learning methods on heterogeneous de-sign and marketing data. Recent feature learning researchhas focused on homogeneous data, in which all variables arereal-valued numbers such as pixel values for image recog-nition [8, 15]; in contrast, we explicitly model the heteroge-neous distribution of the input variables, for example ‘age’being a real-valued variable and ‘General Motors’ being acategorical variable. Subsequently, we give a number of the-oretical extensions: First, we use exponential family gener-alizations for the sparse restricted Boltzmann machines, en-abling explicit modeling of statistical distributions for het-erogeneous data. Second, we derive theoretical bounds onthe reconstruction error of the low-rank + sparse matrix de-composition feature learning method.

This paper is structured as follows: Section 2 discussesefforts to increase prediction accuracy by the design com-munity, as well as feature learning advances in the machinelearning community. Section 3 sets up the preference predic-tion task as a binary classification problem. Section 4 detailsthree feature learning methods and their extension to suit het-erogeneous design and market data. Section 5 details the ex-perimental setup of the preference prediction task, followedby results showing improvement of preference prediction ac-curacy. Section 6 details how features may be used to informdesign decisions through feature interpreation and feature vi-sualization. Section 7 concludes this work.

2 Background and Related WorkDesign preference modeling has been investigated in

design for market systems, where quantitative engineeringand marketing models are linked to improve enterprise-widedecision making [16–18]. In such frameworks, the designpreference model is used to aggregate input across multiplestakeholders, with special importance on the eventual cus-tomer within the targeted market segment [19].

These design preference models have been shown to beespecially useful for the design of passenger vehicles, asdemonstrated across a variety of applications such as enginedesign [20], vehicle packaging [21], brand recognition [22],and vehicle styling [3, 6, 23]. Connecting many of these re-search efforts is the desire for improved prediction accuracyof the underlying design preference model. With increasedprediction accuracy, measured using “held out” portions ofthe data, greater confidence may be placed in the fidelity ofthe resulting design conclusions.

Efforts to improve prediction accuracy involve: (i) De-veloping more complex statistical models to capture the het-erogeneous and stochastic nature of customer preferences;examples yuinclude mixed and nested logit models [1, 2],consideration sets [24], and kernel-based methods [13, 14,25]; and (ii) creating adaptive questionnaires to obtain statedinformation more efficiently using a variety of active learn-ing methods [26, 27].

This work is different from (i) above in that the set

of features learned is agnostic of the particular preferencemodel used. One can just as easily switch out the l2 logit de-sign preference model used in this paper for another model,whether it be mixed logit or a kernel machine. This work isalso different from (ii) in that we are working with a set of re-vealed data on actual vehicle purchases, rather than elicitingthis data through a survey. Accordingly, this work is amongrecent efforts towards data-driven approaches in design [28],including design analytics [29] and design informatics [30],in that we are directly using data to augment existing model-ing techniques and ultimately suggest actionable design de-cisions.

2.1 Feature learningFeature learning methods capture statistical dependen-

cies implicit in the original variables by “encoding” the orig-inal variables in a new feature representation. This repre-sentation keeps the number of data the same while changingthe length of each data point from M variables to K features.The idea is to minimize an objective function defining thereconstruction error between the original variables and theirnew feature representation. If this representation is moremeaningful for the discriminative design preference predic-tion task, we can use the same supervised model (e.g., logitmodel) as before to achieve higher predictive performance.More details are given in Section 4.

The first feature learning method we examined is prin-cipal component analysis (PCA). While not conventionallyreferred to as a feature learning method, PCA is chosen forits ubiquitous use and its qualitative difference from the othertwo methods. In particular, PCA makes the strong assump-tion that the data is Gaussian noise distributed around a linearsubspace of the original variables, with the goal of learningthe eigenvectors spanning this subspace [31]. The featuresin our case are the coefficients of the original variables whenprojected onto this subspace or, equivalently, the inner prod-uct with the learned eigenvectors.

The second feature learning method is low-rank + sparsematrix decomposition (LSD). This method is chosen as it de-fines the features implicitly withing the preference model.In particular, LSD decomposes the “part-worth” coefficientscontained in the design preference model (e.g., conjoint anal-ysis or discrete choice analysis) into a low-rank matrix plusa sparse matrix. This additive decomposition is motivated byresults from the marketing literature suggesting certain pur-chase consideration are linearly additive [32], and thus wellcaptured by decomposed matrices [33]. An additional mo-tivation for a linear decomposition model is the desire forinterpretability [34]. Predictive consumer marketing often-times uses these learned coefficients to work hand-in-handwith engineering design to generate competitive products orservices [35]. Such advantages are bolstered by separation offactors captured by matrix decomposition, as separation maylead to better capture of heterogeneity among market seg-ments [36]. Readers are referred to [37] for further in-depthdiscussion.

The third feature learning method is the exponential

family sparse restricted Boltzmann machine (RBM) [9, 38].This method is chosen as it explicitly represents the features,in contrast with the LSD. The method is a special case ofa Boltzmann machine, an undirected graph model in whichthe energy associated within an energy state space definesthe probability of finding the system in that state [9]. In theRBM, each state is determined by both visible and hiddennodes, where each node corresponds to a random variable.The visible nodes are the original variables, while the hiddennodes are the feature representation. The “restricted” portionof the RBM refers to the restriction on visible-visible con-nections and hidden-hidden connections, later detailed anddepicted in in Section 4 and Figure 4, respectively.

All three feature learning methods are considered “sim-ple” in that they are single-layer models. The aforemen-tioned results in 1D waveform speech recognition and 2Dimage object recognition have been achieved using hierar-chical models, built by stacking multiple single-layer mod-els. We chose single-layer feature learning methods here asan initial effort and to explore parameter settings more eas-ily; as earlier noted, there is limited work on feature learningmethods for heterogeneous data (e.g., categorical variables)and most advances are currently only on homogeneous data(e.g., real-valued 2D image pixels).

3 Preference Prediction as Binary ClassificationWe cast the task of predicting a customer’s design pref-

erences as a binary classification problem: Given customerj, represented by a vector of heterogeneous customer vari-ables x( j)

c , as well as two passenger vehicle designs p and q,each represented by a vector of heterogeneous vehicle designvariables x(p)

d and x(q)d , which passenger vehicle will the cus-tomer purchase? We use a real data set of customers and theirpassenger vehicle purchase decisions as detailed below [39].

3.1 Customer and vehicle purchase data from 2006The data used in this work combines the Maritz vehicle

purchase survey from 2006 [39], the Chrome vehicle vari-able database [40], and the 2006 estimated U.S. state incomeand living cost data from the U.S. Census Bureau [41] tocreate a data set with both customer and passenger vehiclevariables. These combined data result in a matrix of pur-chase records, with each row corresponding to a separatecustomer and purchased vehicle pair, and each column cor-responding to a variable describing the customer (e.g., age,gender, income) or the purchased vehicle (e.g., # cylinders,length, curbweight).

From this original data set, we focus only on the cus-tomer group who bought passenger vehicles of size classesbetween mini-compact and large vehicles, thus excludingdata for station wagons, trucks, minivans, and utility vehi-cles. In addition, purchase data for customers who did notconsider other vehicles before their purchases were removed,as well data for customers who purchased vehicles for an-other party.

The resulting database contained 209 unique passengervehicle models bought by 5582 unique customers. The full

Table 1. Customer variables xc and their variable types

CustomerVariable

Type CustomerVariable

Type

Age Real U.S. State Costof Living

Real

Number ofHouse Members

Real Gender Binary

Number ofSmall Children

Real Income Bracket Categorical

Number of Med.Children

Real House Region Categorical

Number ofLarge Children

Real Education Level Categorical

Number ofChildren

Real U.S. State Categorical

U.S. StateAverage Income

Real

Table 2. Design variables xd and their variable types

Design Variable Type Design Variable Type

Invoice Real AWD/4WD Binary

MSRP Real AutomaticTransmission

Binary

Curbweight Real Turbocharger Binary

Horsepower Real Supercharger Binary

MPG(Combined)

Real Hybrid Binary

Length Real Luxury Binary

Width Real Vehicle Class Categorical

Height Real Manufacturer Categorical

Wheelbase Real PassengerCapacity

Categorical

Final Drive Real Engine Size Categorical

Diesel Binary

list of customer variables and passenger vehicle variables canbe found in Tables 1 and 2. The variables in these tables aregrouped into three unit types: Real, binary, and categorical,based on the nature of the variables.

3.2 Choice set training, validation, and testing splitWe converted the data set of 5582 passenger vehicle pur-

chases into a binary choice set by generating all pairwisecomparisons between the purchased vehicle and the other208 vehicles in the data set for all 5582 customers. Thisresulted in N = 1,161,056 data points, where each datumindexed by n consisted of a triplet ( j, p,q) of a customer in-

dexed by j and two passenger vehicles indexed by p and q,as well as a corresponding indicator variable y(n) ∈ 0,1 de-scribing which of the two vehicles was purchased.

This full data were then randomly shuffled, and splitinto training, validation, and testing sets. As previous studieshave shown the impact on prediction performance given dif-ferent generations of choice sets [42], we created 10 randomshufflings and subsequent data splits of our data set, and runthe design preference prediction experimental procedure ofSection 5 on each one independently. This work is thereforecomplementary to studies on developing appropriate choiceset generation schemes such as [43]. Full details into the dataprocessing procedure are given in Section 5.

3.3 Bilinear design preference utilityWe adopt the conventions of utility theory for the mea-

sure of customer preference over a given product [44]. For-mally, each data point consists of a pairwise comparison be-tween vehicles p and q for customer j , with correspond-ing customer variables x( j)

c for j ∈ 1, . . . ,5582 and orig-inal variables of the two vehicle designs, x(p)

d and x(q)d forp,q ∈ 1, . . . ,209. We assume a bilinear utility model forcustomer j and vehicle p:

U jp =

[vec(

x( j)c ⊗x(p)

d

)T,(

x(p)d

)T]

ω, (1)

where ⊗ is an outer product for vectors, vec(·) is vectoriza-tion of a matrix, [·, ·] is concatenation of vectors, and ω is thepart-worth vector.

3.4 Design preference modelThe preference model refers to the assumed relationship

between the bilinear utility model described in Section 3.3and a label indicating which of the two vehicles the customeractually purchased. While the choice of preference model isnot the focus of this paper, we pilot-tested popularly usedmodels including l1 and l2 logit model, naıve Bayes, l1 andl2 linear as well as kernelized support vector machine, andrandom forests.

Based on these pilot results, we chose the l2 logit modeldue to its widespread use in the design and marketing com-munities [37, 45]; in particular, we used the primal form ofthe logit model. Equation (2) captures how the logit modeldescribes the probabilistic relationship between customer j’spreference for either vehicle p or vehicle q as a function oftheir associated utilities given by Equation (1). Note thatε are Gumbel-distributed random variables accounting fornoise over the underlying utility of the customer j’s pref-erence for either vehicle p or vehicle q.

P(n) = P( j,p,q) = P(U jp + ε jp >U jq + ε jq) =eU jp

eU jp + eU jq

(2)

Fig. 2. The concept of principle component analysis shown usingan example with a data point represented by three original variablesx projected to a two dimensional subspace spanned by w to obtainfeatures h.

Parameter EstimationWe estimate the parameters of the logit model in Eq. (2)

using conventional convex loss function minimization usingthe log-loss regularized with the l2 norm.

minω,α

1N

N

∑n=1

(y(n) logP(n)+(1− y(n)) log(1−P(n)))+α‖ω‖2

(3)

where y(n) = y( jpq) is 1 if customer j chose vehicle p to pur-chase, and 0 if vehicle q was purchased; and α is the l2 regu-larization hyperparameter. The optimization algorithm usedto minimize this regularized loss function was stochastic gra-dient descent, with details of hyperparameter settings givenin Section 5.

4 Feature LearningWe present three qualitatively different feature learning

methods as introduced in Section 2: (1) principal componentanalysis, (2) low-rank + sparse matrix decomposition, and(3) exponential family sparse restricted Boltzmann machine.Furthermore, we discuss their extensions to better suit themarket data described in Section 3, as well as derivation oftheoretical guarantees.

4.1 Principal Component AnalysisPrincipal component analysis (PCA) maps the original

data representation x = [x1,x2, . . . ,xM]T ∈ RM×1 to a newfeature representation h = [h1,h2, . . . ,hK ]

T ∈ RK×1, K ≤M,with an orthogonal transformation W ∈ RM×K . Assume thatthe original data representation x has zero empirical mean(otherwise we simply subtract the empirical mean from x).The mapping is given by:

h = xT W (4)

The PCA representation has the following properties:(1) h1 has the largest variance, and the variance of hi is notsmaller than the variance of h j for all j < i; (2) the columnsof W are orthogonal unit vectors; and (3) h and W minimizethe reconstruction error ε:

ε = ||x−h||2 (5)

When the q columns of W consist of the first q eigen-vectors of xT x, the above properties are all satisfied, and thePCA feature representation can be calculated by Equation(4). Since PCA is a projection onto a subspace, the featuresh in this case are not “higher order” functions of the originalvariables, but rather a linear mapping from original variablesto a strictly smaller number of linear coefficients over theeigenvectors.

4.2 Low-Rank + Sparse Matrix DecompositionThe utility model Urp given in Equation (1) can be

rewritten into matrix form, in which Ω is a matrix reshapedfrom the “part-worth” coefficients vector ω:

Urp = [(

x( j)c

)T,1]Ωxp

d (6)

The decomposition of the original part-worth coefficientsinto a low-rank matrix and a sparse matrix may better rep-resent customer purchase decisions than the large coefficientmatrix of all pairwise interactions given in Equation (1) andas detailed in Section 2. Accordingly, we decompose Ω intoa low-rank matrix L of rank r superimposed with a sparsematrix S, i.e. Ω = L+S. This problem may be solved in thegeneral case exactly with the following optimization prob-lem:

minL,S

l(L,S;Xc,Xd ,y) (7)

s.t. rank(L)≤ r

S ∈ C

where Xu and Xc are the full set of customer and vehicledata, y is the vector of whether customer j chose vehicle por vehicle q, l(·) is the log-loss without the l2 norm,

l(L,S;Xc,Xd ,y)

=1N

N

∑n=1

(y(n) logP(n)+(1− y(n)) log(1−P(n))) (8)

and C is a convex set corresponding to the sparse matrix S.As this problem is intractable (NP-hard), we instead learnthis decomposition of matrices using an approximation ob-tained via regularized loss function minimization:

minL,S

l(L,S;Xc,Xd ,y)+λ1||L||∗+λ2||S||1 (9)

Fig. 3. The concept of low-rank + sparse matrix decomposition us-ing an example “part-worth coefficients” matrix of size 10 x 10 decom-posed into two 10 x 10 matrices with low rank or sparse structure.Lighter colors represent larger values of elements in each decom-posed matrix.

where ||·||∗ is the nuclear norm to promote low-rank struc-ture, and ||·||1 is the l1-norm.

In particular, while a number of low-rank regulariza-tions may be used to solve Eq. (9), e.g., trace norm andlog-determinant norm [46]. We choose the nuclear normas it may be applied to any general matrix, while the tracenorm and log-determinant regularization are limited to pos-itive semidefinite matrices. Moreover, the nuclear norm isoften considered optimal as ||L||∗ is the convex envelop ofRank(L), implying that ||L||∗ is the largest convex functionsmaller than Rank(L) [46].

Definition 1. For matrix L,the nuclear norm is defined as,

||L||∗ :=min(dim(L))

∑i=1

si(L)

where si(L) is a singular value of L.

4.2.1 Parameter EstimationThe non-differentiability of the convex low-rank +

sparse approximation given in Eq. (9) necessitates optimiza-tions techniques such as augmented Lagrangian [47], semi-definite programming [48], and proximal methods [49]. Dueto theoretical guarantees on convergence, we choose to trainour model using proximal methods which are defined as fol-lows.

Definition 2. Let f : Rn → R∪ +∞ be a closed properconvex function. The proximal operator of f is defined as

prox f (v) = argminx

(f (x)+

12||v−x||22

)

With these preliminaries, we now detail the proximalgradient algorithm used to solve Eq. (9) using low-rank andl1 proximal operators. Denote f (·) = || · ||∗, and its proximaloperator as prox f . Similarly denote the proximal operatorfor the l1 regularization term by proxS, i = 1, . . .n. Details ofcalculating prox f and proxS may be found in Appendix A.

Algorithm 1 Low-Rank + Sparse Matrix DecompositionInput: Data Xc, Xd , yInitialize L0 = 0,S0 = 0repeat

Lt+1 = prox f (Lt −ηt∇Lt l(L,S;Xc,Xd ,y))St+1 = proxS(St −ηt∇St l(L,S;Xc,Xd ,y))

until Lt , Sti are converged

With this notation, the proximal optimization algorithmto solve Equation (9) is given by Algorithm 1. Moreover, thisalgorithm is guaranteed to converge with constant step sizeas given by the following lemma [49].

Lemma 1. Convergence PropertyWhen ∇l is Lipschitz continuous with constant ρ, this methodcan be shown to converge with rate O( 1

k ) when a fixed stepsize ηt = η ∈ (0,1/ρ] is used. If ρ is not known, the stepsizes ηt can be found by a line search; that is, their valuesare chosen in each step.

4.2.2 Error Bound on Low-Rank + Sparse EstimationWe additionally prove a variational bound that guaran-

tees this parameter estimation method converges to a uniquesolution with bounded error as given by the following theo-rem.

Theorem 1. Error Bound on Low-Rank+Sparse Estimation

|4l| ≤min(dim(L0)) ||L∗−L0||2

where L∗ is the optima of problem (9) and L0 is the matrixminimizing the loss function l(·).

The proof of this theorem is given in Appendix A.

4.3 Restricted Boltzmann machineThe restricted Boltzmann machine (RBM) is an energy-

based model in which an energy state is defined by a layerof M visible nodes corresponding to the original variablesx and a layer of K features denoted as h. The energy for agiven pair of original variables and features determines theprobability associated with finding the system in that state;like nature, systems tend to states that minimize their energyand thus maximize their probability. Accordingly, maximiz-ing the likelihood of the observed data x(1) . . .x(N) ∈RM andits corresponding feature representation h(1) . . .h(N) ∈ RK isa matter of finding the set of parameters that minimize theenergy for all observed data.

While traditionally this likelihood consists of binaryvariables and binary features, as described in Table 1 andTable 2, our passenger vehicle purchase data set consists ofMG Gaussian variables, MB binary variables, and MC cate-gorical variables. We accordingly define three correspondingenergy functions EG, EB, and EC, in which each energy func-tion connects the original variables and features via a weightmatrix W, as well as biases for each original variable andfeature, a and b respectively.

Fig. 4. The concept of the exponential family sparse restrictedBoltzmann machine. The original data are represented by nodesin the visible layer by [x1,x2], while the feature representationof the same data is represented by nodes in the hidden layer[h1,h2,h3,h4]. Undirected edges are restricted to being only be-tween the original layer and the hidden layer, thus enforcing condi-tional independence between nodes in the same layer.

Real-valued random variables (e.g., vehicle curb weight)are modeled using the Gaussian density. The energy functionfor Gaussian inputs and binary hidden nodes is:

EG(x,h;θ) =−MG

∑m=1

K

∑k=1

hkwkmxm

+12

MG

∑m=1

(xm−bm)2−

K

∑k=1

akhk

(10)

where the variance term is clamped to unity under the as-sumption that the input data are standardized.

Binary random variables (e.g., gender) are modeled us-ing the Bernoulli density. The energy function for Bernoullinodes in both the input layer and hidden layer is:

EB(x,h;θ) =−MB

∑m=1

K

∑k=1

hkwkmxm

−MB

∑m=1

xmbm−K

∑k=1

akhk

(11)

Categorical random variables (e.g., vehicle manufac-turer) are modeled using the categorical density. The energyfunction for categorical inputs with Zm classes for m-th cate-gorical input variable (e.g., Toyota, General Motors, etc.) isgiven by:

EC(x,h;θ) =−Km

∑m=1

K

∑k=1

Zm

∑z=1

hkwkmzδmzxmz

−MC

∑m=1

Zm

∑z=1

δmzxmzbmz−K

∑k=1

akhk

(12)

where δmz = 1 if xmz = 1 and 0 otherwise.Given these energy functions for the heterogeneous

original variables, the probability of a state with energyE(x,h;θ) = EG(x,h;θ)+EB(x,h;θ)+EC(x,h;θ), in which

θ = W,a,b are the energy function weights and bias pa-rameters, is defined by the Boltzmann distribution.

P(x,h) =e−E(x,h;θ)

∑x ∑h e−E(x,h;θ) (13)

The “restriction” on the RBM is to disallow visible-visible and hidden-hidden node connections. This restrictionresults in conditional independence of each individual hid-den unit h given the vector of inputs x, and each visible unitx given the vector of hidden units h.

P(h|x) =N

∏n=1

P(hn|x) (14)

P(x|h) =K

∏k=1

P(xk|h) (15)

The conditional density for a single binary hidden unitgiven the combined KG Gaussian, KB binary, and KC categor-ical input variables is then:

σ(an +KG

∑k=1

wnkxk +KB

∑k=1

wnkxk +KC

∑k=1

Dk

∑d=1

wnkδkdxk) (16)

where σ(s) = 11+exp(−s) is a sigmoid function.

For an input data point x(n), its corresponding featurerepresentation h(n) is given by sampling the “activations” ofthe hidden nodes.

[P(h1 = 1|x,θ) , ... , P(hN = 1|x,θ)] (17)

Parameter EstimationTo train the model, we optimize the weight and bias

parameters θ = W,b,a by minimizing the negative log-likelihood of the data x(1) . . .x(N) using gradient descent.The gradient of the log-likelihood is:

∂

∂θ

N

∑n=1

logP(

x(n))=

∂

∂θ

N

∑n=1

log∑h

P(

x(n),h)

=∂

∂θ

N

∑n=1

log∑h

e−E(x(n),h)

∑x,h e−E(x(n),h)

=N

∑n=1

Eh|x(n)

[∂

∂θE(

x(n),h)]

−Eh,x

[∂

∂θE (x,h)

](18)

The gradient is the difference of two expectations, the first ofwhich is easy to compute since it is “clamped” at the inputdatum x, but the second of which requires the joint densityover the entire x space for the model.

In practice, this second expectation is approximated us-ing the contrastive divergence algorithm by Gibbs, sampling

the hidden nodes given the visible nodes, then the visiblenodes given the hidden nodes, and iterating a sufficient num-ber of steps for the approximation [50]. During training, weinduce sparsity of the hidden layer by setting a target activa-tion βk, fixed to 0.1, for each hidden unit hk [38]. The overallobjective to be minimized is then the negative log-likelihoodfrom Equation (18) and a penalty on the deviation of the hid-den layer from the target activation. Since the hidden layer ismade up of sigmoid densities, the overall objective functionis:

N

∑n=1

log∑h

P(

x(n),h)

+λ3

K

∑k=1

(β(n)k loghk +

(1−β

(n)k

)log(1−hk)

),

(19)

where λ3 is the hyperparameter trading off the sparsitypenalty with the log-likelihood.

5 ExperimentThe goal in this experiment was to assess how prefer-

ence prediction accuracy changes when using the same pref-erence model on three different representations of the samedata set. The preference model used, as discussed in Section3.4, was the l2 logit, while the three representations werethe original variables, low-rank + sparse features, and RBMfeatures. The same experimental procedure was run on eachof these three representations, where the first representationacts as a baseline for prediction accuracy, and the next tworepresentations demonstrate the relative gain in preferenceprediction accuracy when using features.

In addition, we performed an analysis of how the hyper-parameters affected design preference prediction accuracyfor the hyperparameters used in the PCA, LSD, and RBMfeature learning methods. For PCA, the hyperparameter wasthe dimensionality K of the subspace spanned by the eigen-vectors of the PCA method. For LSD, the hyperparameterswere the rank penalty λ1, which affects the rank of the low-rank matrix L, and the sparsity penalty λ2, which influencesthe number of non-zero elements in the sparse matrix S, bothfound in Equation (9). For RBM, the hyperparameters werethe sparsity penalty λ3, which controls the number of fea-tures activated for a given input datum, and the overcom-pleteness factor γ, which defines by what factor the dimen-sionality of the feature space is larger than the dimensional-ity of the original variable space, both of which are found inEquation (19).

The detailed experiment flow is summarized below andillustrated in Figure 5:

1. The raw choice data set of pairs of customers and pur-chased designs, described in Section 3.1, was randomlysplit 10 times into 70% training, 10% validation, and20% test sets. This was done in the beginning to en-sure no customers in the training sets ever existed in thevalidation or test sets.

2. Choice sets were generated for each training, validation,and test sets for all 10 randomly shuffled splits as de-

Fig. 5. Data processing, training, validation, and testing flow.

scribed in Section 3.2. This process created a trainingdata set of 832,000 data points, a validation data set of104,000 data points, and a testing data set of 225,056data points, for each of the 10 shuffled splits.

3. Feature learning was conducted on the training setsof customer variables and vehicle variables for a vec-tor of 5 different values of K for PCA features, agrid of 25 different pairs of low-rank penalty λ1and sparsity penalty λ2 for the LSD features, and agrid of 56 different pairs of sparsity λ3 and over-completeness γ hyperparameters for RBM features.For PCA features, these hyperparameters were K ∈30,50,70,100,150. For LSD features, these hyperpa-rameters were λ1 ∈0.005,0.01,0.05,0.1,0.5 and λ2 ∈0.005,0.01,0.05,0.1,0.5. For RBM, these hyperpa-rameters were λ3 ∈ 4.0,5.0,6.0,7.0,8.0,9.0,10.0 andγ ∈ 0.25,0.5,0.75,1.0,1.5,2.0,2.5,3.0. These hyper-parameter settings were selected by pilot testing largeranges of parameter settings to find relevant regionsfor upper and lower hyperparameter bounds, with num-bers of hyperparameters selected based on computa-tional constraints.

4. Each of the validation and testing data sets were encodedusing the feature learning methods learned for each ofthe 5 PCA hyperparameters K, 25 (λ1,λ2) LSD hyper-parameter pairs, and 56 (λ3,γ) RBM hyperparameterpairs.

5. The encoded feature data was combined with the origi-nal variable data in order to separate linear term effectsof the original variables with higher order effects fromthe features. While this introduces a degree of infor-mation redundancy between features and original vari-ables, the regularization term in Equation 3 mitigates ef-fects of collinearity. Each datum consists of the featuresconcatenated with the original variables, then input intothe bilinear utility model. Specifically, for some cus-tomer features hu and customer variables xu, we usedhT

u′ := [xTu ,hT

u ] to define the new representation of thecustomer; likewise, for some vehicle features hc andvehicle variables xc, we used hT

c′ := [xTc ,hT

c ] to definethe new representation of the customer. Combined withEquation (1), a single data point used for training is thedifference in utilities between vehicle p and vehicle q

DesignPreference Model

FeatureRepresentation

Prediction Accuracy (std. dev.)(ρ-value)

N = 10,000

Prediction Accuracy (std. dev.)(ρ-value)

N = 1,161,056

Logit Model Original Variables(No Features)

69.98% (1.82%)(N/A)

75.29% (0.98%)(N/A)

Logit Model Principle ComponentAnalysis

61.69% (1.24%)(1.081e-7)

62.03% (0.89%)

(8.22e-10)

Logit Model Low-Rank + SparseMatrix Decomposition

76.59% (0.89%)(3.276e-8)

77.58% (0.81%)(4.286e-8)

Logit Model Exponential FamilySparse RBM

74.99% (0.64%)(2.3e-5)

75.15% (0.81%)

(0.136)

Table 3. Averaged preference prediction accuracy on held-out test data using the logit model with the original variables or the three featurerepresentations. Average and standard deviation were calculated from 10 random training and testing splits common to each method, whiletest parameters for each method were selected via cross validation on the training set.

for a given customer r.

[h(r)

u′ ⊗(

h(p)c′ −h(q)

c′

),h(p)

c′ −h(q)c′

](20)

Note that the dimensionality of each datum could rangeabove 100,000 dimensions for the largest values of γ.

6. For each of these training sets, 6 logit mod-els were trained in parallel over minibatchesof the training data, corresponding to 6 differ-ent settings of the l2 regularization parameterα = 0.00001,0.0001,0.001,0.01,0.1,1.0. Theselogit models were optimized using stochastic gradientdescent, with learning rates inversely related to thenumber of training examples seen [51].

7. Each logit model was then scored according to its re-spective held-out validation data set. The hyperpa-rameter settings (αBASELINE) for the original variables,(KPCA,αPCA) for PCA feature learning, (λ1,λ2) for LSDfeature learning, and (λ3,γ,αRBM) for RBM featurelearning with the best validation accuracy were saved.For each of these four sets of best hyperparameters, Step3 was repeated to obtain the set of corresponding fea-tures on each of the 10 random shuffled training plusvalidation sets.

8. Logit models corresponding to the baseline, PCA fea-tures, LSD features, and RBM features were retrainedfor each of the 10 randomly shuffled and combinedtraining and validation. The prediction accuracy foreach of these 10 logit models was assessed on the corre-sponding “held out” test sets in order to give average andstandard deviations of the design preference predictiveaccuracy for the baseline, PCA features, LSD features,and RBM features.

5.1 ResultsTable 3 shows the averaged test set prediction accuracy

of the logit model using the original variables, PCA features,

LSD features, and RBM features. Prediction accuracy aver-aged over 10 random training and held-out testing data splitsare given, both for the partial data N = 10,000 and the fulldata N = 1,161,056 cases. Furthermore, we include the stan-dard deviation of the prediction accuracies and a 2-sided t-test relative to the baseline accuracy for each feature repre-sentation.

The logit model trained with LSD features achieved thehighest predictive accuracy on both the partial and full datasets, at 76.59% and 77.58%, respectively. This gives evi-dence that using features can improve design preference pre-diction accuracy as the logit model using the original vari-ables achieved an averaged accuracy of 69.98% and 75.29%,respectively. The improvement in design preference predic-tion accuracy is greatest for the partial data case, as evi-denced by both the LSD and RBM, yet the improvement withthe full data case shows that the LSD feature learning methodis still able to improve prediction accuracy within the capac-ity of the logit model. The RBM results for the full data casedo not show significant improvement in prediction accuracy.Finally, we note a relative loss in design preference predic-tion accuracy when using PCA as a feature learning method,both for the partial and full data sets, suggesting the heavyassumptions built into PCA are overly restrictive.

The parameter settings for the LSD feature learningmethod give additional insight to the preference predictiontask. In particular, the optimal settings of λ1 and λ2 obtainedthrough cross validation on the 10 random training sets wasranged from r = 29 to r = 31. This significantly reduced rankof the part-worth coefficient matrix given in Eq. (1) suggeststhat the vast majority of interactions between customer vari-ables and design variables given in Table 1 and Table 2 do notsignificantly contribute to overall design preferences. Thisinsight allows us to introspect into important feature pairingson a per-customer basis to inform design decisions.

We have shown that even “simple” single-layer featurelearning can significantly increase predictive accuracy fordesign preference modeling. This finding signifies that fea-tures more effectively capture the design preferences than the

original variables, as features form functions of the originalvariables more representative of the customer’s underlyingpreference task. This offers designers opportunity for newinsights if these features can be successfully interpreted andtranslated to actionable design decisions; however, given therelatively recent advances in feature learning methods, in-terpretation and visualization of features remains an openchallenge–see Section 6 for further discussion.

Further increases to prediction accuracy might beachieved by stacking multiple feature learning layers, oftenreferred to as “deep learning”. Such techniques have recentlyshown impressive results by breaking previous records in im-age recognition by large margins [8]. Another possible direc-tion for increasing prediction accuracy may be in developingnovel architectures that explicitly capture the conditional sta-tistical structure between customers and designs. These ef-forts may be further aided through better understanding ofthe limitations of using feature learning methods for designand marketing research. For example, the large number ofparameters associated with feature learning methods resultsin greater computational cost when performing model selec-tion; in addition to the cross-validation techniques used inthis paper, model selection metrics such as BIC and AIC maygive further insight along these lines.

6 Using Features for DesignUsing features can support the design process in at least

two directions: (1) Features interpretation can offer deeperinsights into customer preferences than the original vari-ables, and (2) feature visualization can lead to a market seg-mentation with better clustering than with the original vari-ables. These two directions are still open challenges giventhe relative nascence of feature learning methods. Furtherinvestigation is necessary to realize the above design oppor-tunities and to justify the computational cost and implemen-tation challenges associated with feature learning methods.

The interpretation and visualization methods may beused with conventional linear discrete choice modeling (e.g.,logit models). However, deeper insights are possible throughinterpreting and visualizing features, assuming that featurescapture more effectively the underlying design preferenceprediction task of the customer as shown through improvedprediction accuracy on held-out data. Since we are capturing“functions” of the original data, we are more likely to inter-pret and visualize feature pairings such as “eco-friendly” ve-hicle and “environmentally conscious” customer; such pair-ing may ultimately lead to actionable design decisions.

6.1 Feature Interpretation of Design PreferencesSimilar to PCA, LSD provides an approach to interpret

the learned features by looking at the linear combinationsof original variables. The major difference between featureslearned using PCA versus LSD is their different linear com-binations; in particular, features learned by LSD are morerepresentative as they contain information from both the datadistribution and the preference task, while PCA features onlycontain information from the data distribution.

As introduced in section 4.2, the weight matrix Ω is de-composed into a low-rank matrix L and a sparse matrix S,i.e. Ω = L+S. The nonzero elements in the sparse matrixS may be interpreted as the weight of the product of its cor-responding original design variables and customer variables.As for the low-rank matrix L, features can be extracted bylinearly combining the original variable according to the sin-gular value decomposition (SVD) for L. The singular valuedecomposition is a factorization of the (m+1)×n matrix Lin the form L = UΣV, where U is a (m+1)×(m+1) unitarymatrix, Σ is an m× n rectangular diagonal matrix with non-negative real numbers σ1,σ2, . . . ,σmin(m+1,n) on the diagonal,and V is a (n)× (n) unitary matrix. Rewriting Equation (6):

Urp =

[(x( j)

c

)T,1]

Lxpd +

[(x( j)

c

)T,1]

Sxpd

=

[(x( j)

c

)T,1]

UΣVxpd +

[(x( j)

c

)T,1]

Sxpd

=min(m+1,n)

∑i=1

σi

[(x( j)

c

)T,1]

uivixpd +

[(x( j)

c

)T,1]

Sxpd

(21)where ui is the i-th column of matrix U , and vi is the i-th row

of matrix V . The i-th user feature[(

x( j)c

)T,1]

ui is a linear

combination of original user variables; the i-th design featurevix

pd is a linear combination of original design variables; and

σi represents the importance of this pair of features for thecustomer’s design preferences.

Interpreting these features in the vehicle preference casestudy, we found that the most influential feature pairing (i.e.,largest σi) corresponds to preference trends at the popula-tion level: Low price but luxury vehicles are preferred, andJapanese vehicles receive the highest preference while GMvehicles receive the lowest preference. The second most in-fluential feature pairing represents a rich customer group,with preferred vehicle groups being both expensive and lux-urious. The third most influential feature pairing representsan elder user group, with their preferred vehicles as large butwith low net horsepower.

6.2 Features Visualization of Design PreferencesWe now visualize features to understand what insights

for design decision making. Specifically, we make early-stage inroad to visual market segmentation performed in anestimated feature space, thus clustering customers in a rep-resentation that better captures their underlying design pref-erence decisions.

We begin by looking at the utility model Urp given inEquation (1) and note that the inner product between Ω andthe variables x(r)u representing customer r may be interpretedas customer r’s optimal vehicle, denoted x(r)opt :

x(r)opt =(

x(r)u

)TΩout +1T

Ωmain (22)

where Ωout is the matrix reshaped from the coefficients ofΩ corresponding to the outer product given in Equation (1),

Fig. 6. Optimal vehicle distribution visualization. Every point repre-sents the optimal vehicle for one consumer. In the left column, theoptimal vehicle is inferred using the utility model with original vari-ables. In the right column, LSD features are used to infer the optimalvehicle. In the first row, the optimal vehicles from SCI-XA customersare marked in big red points. Similarly, the optimal vehicles fromMAZDA6, ACURA-TL and INFINM35 customers are marked in bigred points respectively.

Ωmain is the matrix reshaped from the remaining coefficients,and 1 is a vector consisting of 1’s with the same dimensionas x(r)u . We rewrite the utility model Urp given in Equation(1) in terms of the optimal vehicle x(r)opt :

Urp =(

x(r)opt

)Txp

d (23)

According to the geometric meaning of inner product,the smaller the angle between xp

d and x(r)opt is, the larger will bethe utility Urp. In this way, we have an interpretable methodof improving upon the actual purchased vehicle design in theform of an ’optimal’ vehicle vector. This optimal vehiclevector could be useful for a manufacturer developing a next-generation design from a current design, particularly as themanufacturer would target a specific market segment.

We now provide a visual demonstration of using an op-timal vehicle derived from feature learning to suggest a de-sign improvement direction. First, we calculate the optimalvehicle using Equation (22) for every customer in the dataset. Then, we visualize these optimal vehicle points by re-ducing their dimension using t-distributed stochastic neigh-bor embedding (t-SNE), an advanced nonlinear dimensionreduction technique that embeds similar objects into nearbypoints [52]. Finally, optimal vehicles from targeted marketsegments are marked in red.

Figure 6 shows the optimal vehicles for the SCI-XA,MAZDA6, ACURA-TL and INFINM35 customer groups

using red points respectively. We observe that the optimalvehicle moves from the left-top corner to the right-bottomcorner as the purchased vehicles become more luxurioususing the LSD features, while the optimal vehicles in theoriginal variable representation show overlap, especially forMAZDA6 and ACURA-TL customers. In other words, weare visualizing what has been shown quantitatively throughincreased preference prediction accuracy; namely, that opti-mal vehicles represented using LSD features as opposed tothe original variables result in a larger separation of variousmarket segments’ optimal vehicles.

The contribution of this demonstration is not the particu-lar introspection on the chosen example with MAZDA6 andACURA-TL customers. Instead, this demonstration is sig-nificant as it suggests it is possible to perform feature-basedmarket segmentation purely using visual analysis. Such vi-sual analysis is likely to be more useful to practicing design-ers and marketers, as it abstracts away the underlying math-ematical mechanics of feature learning.

7 ConclusionFeature learning is a promising method to improve de-

sign preference prediction accuracy without changing the de-sign preference model or the data set. This improvement isobtained by transforming the original variables to a featurespace acting as an intermediate step as shown in Figure 1.Thus, feature learning complements advances in both datagathering and design preference modeling.

We presented three feature learning methods–principalcomponent analysis, low-rank plus sparse matrix decompo-sition, and sparse exponential family restricted Boltzmannmachines–and applied them to a design preference data setconsisting of customer and passenger vehicle variables withheterogeneous unit types, e.g., gender, age, # cylinders.

We then conducted an experiment to measure designpreference prediction accuracy involving 1,161,056 datapoints generated from a real purchase dataset of 5582 cus-tomers. The experiment showed that feature learning meth-ods improve preference prediction accuracy by 2-7% for asmall and full dataset, respectively. This finding is signifi-cant, as it shows that features offer a better representation ofthe customer’s underlying design preferences than the origi-nal variables. Moreover, the finding shows that feature learn-ing methods may be successfully applied to design and mar-keting data sets made up of variables with heterogeneousdata types; this is a new result as feature learning methodshave primarily been applied on homogeneous data sets madeup of variables of the same distribution.

Feature interpretation and visualization offer a promisefor using features to support the design process. Specifi-cally, interpreting features can give designers deeper insightsof the more influential pairings of vehicle features and cus-tomer features, while visualization of the feature space canoffer deeper insights when performing market segmentation.These new findings suggest opportunities to develop featurelearning algorithms that are not only more representative ofthe customer preference task as measured by prediction ac-curacy but also easier to interpret and visualize by a domain

expert. Methods allowing easier interpretation of featureswould be valuable when translating the results of more so-phisticated feature learning and preference prediction mod-els into actionable design decisions.

AcknowledgmentsAn earlier conference version of this work appeared at

the 2014 International Design Engineering Technical Con-ference. This work has been supported by the National Sci-ence Foundation under Grant No. CMMI-1266184. Thissupport is gratefully acknowledged.The authors would liketo thank Bart Frischknecht and Kevin Bolon for their assis-tance in coordinating data sets, Clayton Scott for useful sug-gestions, and Maritz Research Inc. for generously makingthe use of their data possible.

References[1] Berkovec, J., and Rust, J., 1985. “A nested logit

model of automobile holdings for one vehicle house-holds”. Transportation Research Part B: Methodologi-cal, 19(4), pp. 275–285. 1, 2

[2] McFadden, D., and Train, K., 2000. “Mixed MNLmodels for discrete response”. Journal of AppliedEconometrics, 15(5), pp. 447–470. 1, 2

[3] Reid, T. N., Frischknecht, B. D., and Papalambros,P. Y., 2012. “Perceptual attributes in product design:Fuel economy and silhouette-based perceived environ-mental friendliness tradeoffs in automotive vehicle de-sign”. Journal of Mechanical Design, 134, p. 041006.1, 2

[4] Norman, D. A., 2007. Emotional Design: Why We Love(or Hate) Everyday Things. Basic Books, Mar. 00000.1

[5] He, L., Wang, M., Chen, W., and Conzelmann, G.,2014. “Incorporating social impact on new productadoption in choice modeling: A case study in greenvehicles”. Transportation Research Part D: Transportand Environment, 32, pp. 421–434. 1

[6] Sylcott, B., Michalek, J. J., and Cagan, J., 2013.“Towards understanding the role of interaction ef-fects in visual conjoint analysis”. In ASME Inter-national Design Engineering Technical Conferencesand Computers and Information in Engineering Con-ference, American Society of Mechanical Engineers,pp. V03AT03A012–V03AT03A012. 1, 2

[7] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed,A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,Sainath, T. N., et al., 2012. “Deep neural networks foracoustic modeling in speech recognition: The sharedviews of four research groups”. Signal ProcessingMagazine, IEEE, 29(6), pp. 82–97. 1

[8] Krizhevsky, A., Sutskever, I., and Hinton, G. E., 2012.“Imagenet classification with deep convolutional neu-ral networks”. In Advances in Neural Information Pro-cessing Systems 25. pp. 1097–1105. 1, 1, 5.1

[9] Smolensky, P., 1986. “Parallel distributed processing:Explorations in the microstructure of cognition, vol.

1”. MIT Press, Cambridge, MA, USA, ch. Informa-tion Processing in Dynamical Systems: Foundations ofHarmony Theory, pp. 194–281. 1, 2.1

[10] Salakhutdinov, R., Mnih, A., and Hinton, G., 2007.“Restricted Boltzmann machines for collaborative fil-tering”. Proceedings of the 24th International Confer-ence on Machine Learning, pp. 791–798. 1

[11] Burnap, A., Ren, Y., Gerth, R., Papazoglou, G., Gon-zalez, R., and Papalambros, P. Y., 2015. “When crowd-sourcing fails: A study of expertise on crowdsourceddesign evaluation”. Journal of Mechanical Design,137(3), p. 031101. 1

[12] Panchal, J., 2015. “Using crowds in engineering de-signatowards a holistic framework”. In Proceedings ofthe 2015 International Conference on Engineering De-sign, Design Society, pp. 1–10. 1

[13] Chapelle, O., and Harchaoui, Z., 2004. “A machinelearning approach to conjoint analysis”. Advances inNeural Information Processing Systems, pp. 257–264.1, 2

[14] Evgeniou, T., Pontil, M., and Toubia, O., 2007. “A con-vex optimization approach to modeling consumer het-erogeneity in conjoint estimation”. Marketing Science,26(6), pp. 805–818. 1, 2

[15] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y.,2011. “Unsupervised learning of hierarchical represen-tations with convolutional deep belief networks”. Com-munications of the Association for Computing Machin-ery, 54(10), pp. 95–103. 1

[16] Wassenaar, H. J., and Chen, W., 2003. “An approachto decision-based design with discrete choice analysisfor demand modeling”. Journal of Mechanical Design,125, p. 490. 2

[17] Lewis, K. E., Chen, W., and Schmidt, L. C., 2006. De-cision making in engineering design. American Societyof Mechanical Engineers. 2

[18] Michalek, J., Feinberg, F., and Papalambros, P., 2005.“Linking marketing and engineering product design de-cisions via analytical target cascading”. Journal ofProduct Innovation Management, 22(1), pp. 42–62. 2

[19] Chen, W., Hoyle, C., and Wassenaar, H. J., 2013.Decision-Based Design. Springer London, London. 2

[20] Wassenaar, H., Chen, W., Cheng, J., and Sudjianto, A.,2005. “Enhancing discrete choice demand modelingfor decision-based design”. Journal of Mechanical De-sign, 127(4), pp. 514–523. 2

[21] Kumar, D., Hoyle, C., Chen, W., Wang, N., Gomez-Levi, G., and Koppelman, F., 2007. “Incorporating cus-tomer preferences and market trends in vehicle packag-ing design”. International Journal of Production De-sign. 2

[22] Burnap, A., Hartley, J., Pan, Y., Gonzalez, R., and Pa-palambros, P. Y., 2015. “Balancing design freedom andbrand recognition in the evolution of automotive brandstyling”. Design Science. 2

[23] Orsborn, S., Cagan, J., and Boatwright, P., 2009.“Quantifying aesthetic form preference in a utilityfunction”. Journal of Mechanical Design, 131(6),

p. 061001. 2[24] Morrow, W. R., Long, M., and MacDonald, E. F., 2014.

“Market-system design optimization with consider-then-choose models”. Journal of Mechanical Design,136(3), p. 031003. 2

[25] Ren, Y., Burnap, A., Papalambros, P., et al., 2013.“Quantification of perceptual design attributes using acrowd”. In DS 75-6: Proceedings of the 19th Inter-national Conference on Engineering Design (ICED13),Design for Harmonies, Vol. 6: Design Information andKnowledge, Seoul, Korea, 19-22.08. 2013. 2

[26] Toubia, O., Simester, D. I., Hauser, J. R., and Dahan,E., 2003. “Fast polyhedral adaptive conjoint estima-tion”. Marketing Science, 22(3), pp. 273–303. 2

[27] Abernethy, J., Evgeniou, T., Toubia, O., and Vert, J.-P., 2008. “Eliciting consumer preferences using robustadaptive choice questionnaires”. IEEE Transactions onKnowledge and Data Engineering, 20(2), pp. 145–155.2

[28] Tuarob, S., and Tucker, C. S., 2015. “Automated dis-covery of lead users and latent product features by min-ing large scale social media networks”. Journal of Me-chanical Design, 137(7), p. 071402. 2

[29] Van Horn, D., and Lewis, K., 2015. “The use of ana-lytics in the design of sociotechnical products”. Artifi-cial Intelligence for Engineering Design, Analysis andManufacturing, 29(01), pp. 65–81. 2

[30] Dym, C. L., Agogino, A. M., Eris, O., Frey, D. D.,and Leifer, L. J., 2005. “Engineering design thinking,teaching, and learning”. Journal of Engineering Edu-cation, 94(1), pp. 103–120. 2

[31] Friedman, J., Hastie, T., and Tibshirani, R., 2001. Theelements of statistical learning, Vol. 1. Springer seriesin statistics Springer, Berlin. 2.1

[32] Gonzalez, R., and Wu, G., 1999. “On the shape of theprobability weighting function”. Cognitive psychology,38(1), pp. 129–166. 2.1

[33] Evgeniou, T., Boussios, C., and Zacharia, G., 2005.“Generalized robust conjoint estimation”. MarketingScience, 24(3), Aug., pp. 415–429. 2.1

[34] Hauser, J. R., and Rao, V. R., 2004. “Conjoint analysis,related modeling, and applications”. Advances in Mar-keting Research: Progress and Prospects, pp. 141–68.2.1

[35] Papalambros, P. Y., and Wilde, D. J., 2000. Principlesof Optimal Design: Modeling and Computation. Cam-bridge University Press, July. 2.1

[36] Lenk, P. J., DeSarbo, W. S., Green, P. E., and Young,M. R., 1996. “Hierarchical bayes conjoint analysis: re-covery of partworth heterogeneity from reduced exper-imental designs”. Marketing Science, 15(2), pp. 173–191. 2.1

[37] Netzer, O., Toubia, O., Bradlow, E. T., Dahan, E., Ev-geniou, T., Feinberg, F. M., Feit, E. M., Hui, S. K.,Johnson, J., Liechty, J. C., Orlin, J. B., and Rao, V. R.,2008. “Beyond conjoint analysis: Advances in prefer-ence measurement”. Marketing Letters, 19(3-4), July,pp. 337–354. 2.1, 3.4

[38] Lee, H., Ekanadham, C., and Ng, A. Y., 2008. “Sparsedeep belief net model for visual area V2”. Advancesin Neural Information Processing Systems 20, pp. 873–880. 2.1, 4.3

[39] Maritz Research Inc., 2007. Maritz Research 2006 newvehicle customer satisfactions survey. Information on-line at: http://www.maritz.com. 3, 3.1

[40] Chrome Systems Inc., 2008. Chrome New Vehi-cle Database. Information inline at: http://www.chrome.com. 3.1

[41] United States Census Bureau, 2006. 2006 U.S. Cen-sus estimates. Information online at: http://www.census.gov. 3.1

[42] Shocker, A. D., Ben-Akiva, M., Boccara, B., and Ne-dungadi, P., 1991. “Consideration set influences onconsumer decision-making and choice: Issues, models,and suggestions”. Marketing Letters, 2(3), pp. 181–197. 3.2

[43] Wang, M., and Chen, W., 2015. “A data-driven net-work analysis approach to predicting customer choicesets for choice modeling in engineering design”. Jour-nal of Mechanical Design, 137(7), p. 071410. 3.2

[44] Von Neumann, J., and Morgenstern, O., 2007. Theoryof Games and Economic Behavior (60th AnniversaryCommemorative Edition). Princeton University Press.3.3

[45] Fuge, M., 2015. “A scalpel not a sword: On the role ofstatistical tests in design cognition”. In ASME 2015 In-ternational Design Engineering Technical Conferencesand Computers and Information in Engineering Con-ference, American Society of Mechanical Engineers,pp. 1–11. 3.4

[46] Fazel, M., 2002. “Matrix rank minimization with ap-plications”. PhD thesis. 4.2

[47] Tomioka, R., Suzuki, T., Sugiyama, M., and Kashima,H., 2010. “A fast augmented lagrangian algorithm forlearning low-rank matrices”. In Proceedings of the 27thInternational Conference on Machine Learning (ICML-10), pp. 1087–1094. 4.2.1

[48] Liu, G., and Yan, S., 2014. “Scalable low-rank rep-resentation”. In Low-Rank and Sparse Modeling forVisual Analysis, Y. Fu, ed. Springer International Pub-lishing, pp. 39–60. 4.2.1

[49] Parikh, N., and Boyd, S., 2013. “Proximal algorithms”.Foundations and Trends in Optimization, 1(3). 4.2.1,4.2.1, A

[50] Hinton, G. E., 2002. “Training products of experts byminimizing contrastive divergence”. Neural computa-tion, 14(8), pp. 1771–1800. 4.3

[51] Bottou, L., 2010. “Large-scale machine learning withstochastic gradient descent”. In Proceedings of COMP-STAT’2010. Springer, pp. 177–186. 6

[52] van der Maaten, L., 2008. “Visualizing data usingt-sne”. Journal of Machine Learning Research, 9,pp. 2579–2605. 6.2

http://www.maritz.com

http://www.chrome.com

http://www.chrome.com

http://www.census.gov

http://www.census.gov

A APPENDIX: Proof of Low-Rank Matrix EstimationGuaranteeThough the low-rank matrix is estimated jointly with the

decomposed matrices as well as the loss function, an accu-rate estimation of the low-rank matrix can still be achievedas guaranteed by the bound as in this section. We subse-quently provide a variational bound of the divergence of theestimated likelihood from the true likelihood.

To simplify the notation in our proof, we redefine thefollowing notation:

(m,n) = dim(Ω)

Before our proof, however, we state the following rele-vant prepositions.

Proposition 1. The proximal operator for the nuclear normf = λ1|| · ||∗ is the singular value shrinkage operator Dλ1 .Consider the singular value decomposition (SVD) of a matrixX ∈ Rm×n with rank r.

X = UΣVT

Dλ1(X) = USλ1(Σ)VT (24)

where the soft-thresholding operator Sλ1(Σ) =diag(max(si − λ1,0)i=1,...,min(m,n)). Moreover, Sλ2(·)is also the proximal operator for the l1 norm.

The matrix decomposition structure of our model buildson the separable sum property [49]:

Proposition 2. Separable Sum PropertyIf f is separable across two variables x and y, i.e., f (x,y) =f1(x)+ f2(y), then,

prox f (x,y) = (prox f1(x), prox f2(y)) (25)

Our proof proceeds as follows. Let us denote the optimaof problem (9) as L∗, the gradient of the loss function l(·)w.r.t L∗ as ∇L∗ l, and the matrix minimizing the loss functionl(·) as L0.

We next prove the following theorems: Theorem 2 pro-vides a tight bound on ∇L∗ l. Corollary 1 bounds the esti-mation error for the learned matrix L∗. Theorem 3 followsby bounding the divergence of likelihood from the true datadistribution where l(·) is a likelihood function.

First, we make the weak assumption that the optimiza-tion problem given in Equation (9) is strictly convex, sincea necessary and sufficient condition is that the saddle pointsfor l(·) and the regularization terms are not overlapping.

Theorem 2. Loss Function Gradient Bound.

||∇L∗ l||2 ≤min(m,n)

Proof. Under the strictly convex assumption, the stationarypoint (i.e., the optima L∗ for the optimization problem (9)) isunique. By Lemma 1, iterations of the proximal gradient op-timization method Lk converge to this optima L∗. Accordingto the fixed point equation for L (Algorithm 1), we have,

L∗ = prox f (L∗−η∇L∗ l) (26)

Denote L∗ − η∇L∗ l as M, representing the argument ofthe proximal operator at the optimal low-rank estimation.The singular value decomposition (SVD) for L∗, M, andprox f (M) yields,

L∗ = UΣVT (27)

M = UMΣMVT

M (28)

prox f (M) = UproxΣproxVT

prox (29)

where U, Uprox ∈ Rm×r; VT , VTprox ∈ Rr×n; Σ,Σprox ∈ Rr×r

with Σ = diag(sii=1,...,r), Σprox = diag(sproxi i=1,...,r).

UM ∈ Rm×m, VTM ∈ Rn×n and ΣM is a m× n rectangular

diagonal matrix.

Without loss of generality, assume that s1 > s2 > · · · >sr > 0, i.e., these singular values are distinct and positive,thus ensuring column orderings are unique. Thus, wemay assert that U = Uprox, V = Vprox and Σ = Σprox dueto the uniqueness of SVD for distinct singular values inL∗ = prox f (M).

According to Proposition 1,

prox f (M) = UM max(Σ

M−ηI,0)

VTM (30)

Note that the dimensionality of prox f (M) is less than thatof the value of M. To bridge the gap between them, we de-fine diagonal sub-matrices ΣM

+ and ΣM− . (In other words, we

partition ΣM into two sub-matrices ΣM+ and ΣM

− .) For all sin-gular values sM

i of M, i = 1,2, . . . ,min(m,n), if sMi −η ≥ 0,

then sMi is a diagonal element of the sub-matrix ΣM

+ , other-wise, sM

i is a diagonal element of the sub-matrix ΣM− . Hence,

max(ΣM+ −ηI,0) = ΣM

+ −ηI and max(ΣM− −ηI,0) = 0.

prox f (M) = U+M(Σ

M+ −ηI

)(V+

M)T

where U+M (V+

M) are left-singular (right-singular) vectorscorresponding to ΣM

+ , U−M and V−M are also defined respec-tively. Again, due to the uniqueness of SVD, we haveU+

M = U and V+M = V

We now rewrite the SVD formula for prox(M) and Mas,

prox(M) = U(Σ

M+ −ηI

)VT (31)

M = UMΣMVT

M

= U+MΣ

M+ (V

+M)T +U−MΣ

M− (V

−M)T

= UΣM+VT +U−MΣ

M− (V

−M)T (32)

By definition of M,

M = UΣVT −η∇L∗ l (33)

Equation (26) and (31) indicates that,

UΣVT = U(ΣM+ −ηI)VT (34)

Σ = ΣM+ −ηI (35)

By Equation (32), (33), and (35), we have

−∇L∗ l = U(Σ

M+ −Σ

)VT +U−MΣ

M− (V

−M)T

= UVT +1η

U−MΣM− (V

−M)T (36)

Note that every diagonal element sM−i in ΣM

− satisfies 0M−i ≤ η.

Hence,

||∇L∗ l||2 ≤ ||UVT ||2 +1η||U−MΣ

M− (V

−M)T ||2

≤ λ1

r

∑i||U:iV T

:i ||2 +min(m,n)−r

∑j=1

sM−i

η||[U−M]: j([V−M]: j)

T ||2

(37)

≤min(m,n) (38)

where U:i or V:i is the i-th column in matrix U or V, and[U−M]: j or [V−M]: j is the j-th column in matrix U−M or V−M .

Summarizing the proof of Theorem 2, the gradient of theloss function at the estimated low-rank matrix is bounded bya unit ball within the original problem space that has radiusof the low-rank regularization parameter . The relaxation ofthe bound partially comes from the second term in inequality(37). This implies that the bound is tighter if the rank of L∗

is increased.Based on the gradient bound given in Theorem 2, we

now bound the estimation error of the learned low-rankmatrix L∗. Although the value of the bound is not explicit inthis proof, in some cases we are able to explicitly calculateits value.

Corollary 1. Learned Low-Rank Matrix Estimation Er-ror. The error ||L∗ −L0||2 is bounded by the diameter ofminimum-sized ball that include the following set

L : ||∇Ll||2 ≤min(m,n)

Proof. The proof directly follows from Theorem 2 and thefact that ∇L0 l = 0.

Since the loss function l(·) is convex, the Euclideannorm of its gradient ∇Ll is non-decreasing as the Euclideandistance ||L−L0||2 is increasing.

When the loss function is sharp around its minima, thenL : ||∇Ll||2 ≤ min(m,n) is a small region which impliesthat L∗ is a good estimation of L0.

We next bound the likelihood divergence when the lossfunction l(·) is a likelihood function. To do this, we useTheorem 2 and Corollary 1 to construct a variational bound.

Theorem 3. Variational Bound on Estimated Likelihood

|4l| ≤min(m,n) ||L∗−L0||2

Proof. By the Lagrangian mean value theorem, there existsL1 ∈ L : Li j ∈

[L∗i j,L

i j0

] such that,

l(L∗;X,S)− l(L0;X,S) = 〈∇L1 l,(L∗−L0)〉≤ ||∇L1 l||2||L∗−L0|| (39)

where 〈A,B〉 denotes inner product of vec(A) and vec(B), inwhich vec(·) is the matrix vectorization operator.

Because of the convexity of l(·), ||∇L1 l||2 ≤ ||∇L∗ l||2.By Theorem 2,

|4l|= l(L∗;X,S)− l(L0;X,S) (40)≤min(m,n) ||L∗−L0||2 (41)

Summarizing the proof of Theorem 3, the variationalbound of the estimated likelihood depends on both the boundof gradient of the likelihood function l(·) given in Theorem 2and the property of the likelihood function in the neighbor-hood of its optima L0 as described by Corollary 1.

Notation Description Notation Description

xc Customer || · || l2 Norm

xd Design || · ||1 l1 Norm

h Features || · ||∗ Nuclear Norm

N Number of Data Points L Low-Rank Matrix

M Dimension of Original Variables S Sparse Matrix

K Dimension of Features λ1Nuclear Norm Regularization

Parameter

n,m,k Indices for N,M,K λ2Sparsity Regularization Parameter

(LSD)

p,q Indices for Arbitrary Design Pair λ3Sparsity Regularization Parameter

(RBM)

j Index for Arbitrary Customer u,v Arbitrary Vectors (used for Proof)

y( jpq)Indicator Variable of Purchased

Design prox f (·) Proximal Operator for f

Xc All Customers U,Σ,V Matrices of SVD Decomposition

Xd All Designs si Singular Value

y All Purchase Design Indicators t Step Index for Proximal Gradient

α l2 Norm Regularization Parameter η Step Size for Proximal Gradient

ω,ΩPart-Worth Coefficients of

Preference Model: Vector, Matrix EG,EB,ECEnergy Function for Gaussian,

Binary, and Categorical Variables

ε jp Gumbel Random Variable MG,MB,MCDimension of Gaussian, Binary,

and Categorical Original Variables

T Transpose Zm Dimension of Categorical Variable

[·, ·] Vector Concatenation wmk Weight Parameter for RBM

⊗ Outer Product ak,bm Bias Parameter for RBM Units

Ss(·) Soft-Threshold Operator δ Delta Function

D(·) Singular Value ThresholdShrinkage Operator σ Sigmoid Function

P(·) Probability r Rank of Low-Rank Matrix

E [·] Expectation γFeature / Original Variable Size

Ratio

θ Parameters (Generic) βSparsity Activation Target for

RBM

Table 4. Notation and description of symbols used in this study.

improving design preference prediction accuracy using

Documents