f ýepf v ˆ wenzw/bigdata2019.html ...bicmr.pku.edu.cn/~wenzw/bigdata/lect1.pdf · 4/68 ’pn...
TRANSCRIPT
大数据分析中的算法
文再文
课程代码:00136720 (本科生),00100863(本研合)
北京大学北京国际数学研究中心
http://bicmr.pku.edu.cn/~wenzw/[email protected]
1/68
2/68
课程信息
大数据分析中的算法侧重数值代数和最优化算法
课程代码:00136720 (本科生),00100863(本研合)
教师信息:文再文,[email protected],微信:wendoublewen
助教信息:杨明瀚,柳伊扬
上课地点:二教301
上课时间:1-16周,每周周二1-2节,双周周四1-2节,8:00am-9:50am
课程主页:http://bicmr.pku.edu.cn/~wenzw/bigdata2020.html
3/68
参考资料
class notes, and reference books or papers“Mining of Massive Datasets”, Jure Leskovec, Anand Rajaraman,Jeff Ullman, Cambridge University Press
“Convex optimization”, Stephen Boyd and Lieven Vandenberghe
“Numerical Optimization”, Jorge Nocedal and Stephen Wright,Springer
“Optimization Theory and Methods”, Wenyu Sun, Ya-Xiang Yuan
“Matrix Computations”, Gene H. Golub and Charles F. Van Loan.The Johns Hopkins University Press
“Deep Learning", Ian Goodfellow and Yoshua Bengio and AaronCourville
4/68
大数据分析相关的网站
Mining Massive Data Sets, Stanford University
Jure Leskovec @ Stanford - Stanford Computer Science
Mining Massive Data Sets: Hadoop Labs, Stanford University
Big Data Analytics, Columbia University
Theoretical Foundations of Big Data Analysis, The SimonsInstitute for the Theory of Computing, UC Berkeley
Core Methods in Data Science, University of Washington
“Theories of Deep Learning”, STATS 385, Stanford University,Fall 2017
courses from Berkeley Artificial Intelligence Research (BAIR)Lab
5/68
课程计划
侧重数值代数和最优化的模型与算法
线性规划,半定规划
压缩感知和稀疏优化基本理论和算法
低秩矩阵恢复的基本理论和算法
随机特征值算法,随机优化算法
图和网络流问题: shorted path, maximum flow等等
机器学习算法:聚类分析,高维数据降维,链接分析,推荐系统,SVM
现代医学成像与高维图像分析:相位恢复以及低温电子显微镜
Optimal transport
深度学习(deep learning)
强化学习(reinforcement learning)
6/68
课程信息
教学方式:课堂讲授
成绩评定办法:
迟交一天(24小时)打折10%,不接受晚交4天的作业和项目(任何时候理由都不接受)
大作业,包括习题和程序:40%
课程项目一:30%
课程项目二:30%
作业要求:i)计算题要求写出必要的推算步骤,证明题要写出关键推理和论证。数值试验题应该同时提交书面报告和程序,其中书面报告有详细的推导和数值结果及分析。ii)可以同学间讨论或者找助教答疑,但不允许在讨论中直接抄袭,应该过后自己独立完成。iii)严禁从其他学生,从互联网,从往年的答案,其它课程等等任何途径直接抄袭。iv)如果有讨论或从其它任何途径取得帮助,请列出来源。
请谨慎选课
7/68
Why Optimization in Machine Learning?
Many problems in ML can be written as
minw∈W
N∑i=1
12‖w>xi − yi‖2
2 + λ‖w‖1 linear regression
minw∈W
1N
N∑i=1
log(1 + exp(−yiw>xi)) + λ‖w‖1 logistic regression
minw∈W
N∑i=1
`(f (w, xi), yi) + λr(w) general formulation
The pairs (xi, yi) are given data, yi is the label of the data point xi
`i(·): measures model fit for data point i (avoids under-fitting)r(w): measures model “complexity" (avoids over-fitting)f (w, x): linear function or models constructed from deep neuralnetworks
8/68
Convolutional neural network (CNN)
A pioneering digit classification neural network by LeCun et. al. Itwas applied by several banks to recognise hand-written numbers onchecks.https://stats385.github.io/networks
9/68
AlexNet
Designed by Hinton and etc in 2012 and achieved a top-5 error of15.3%. The network consisted of five convolutional layers, some ofwhich were followed by max-pooling layers, and three fully-connectedlayers with a final 1000-way softmax. All the convolutional and fullyconnected layers were followed by the ReLU nonlinearity.https://stats385.github.io/networks
10/68
VGGNet
A 19 layer convolutional neural network from VGG group, Oxford, thatwas simpler and deeper than AlexNet. All large-sized filters inAlexNet were replaced by cascades of 3x3 filters (with nonlinearity inbetween). Max pooling was placed after two or three convolutionsand after each pooling the number of filters was always doubled.https://stats385.github.io/networks
11/68
Deconvolutional networks
A generative network that is a special kind of convolutional networkthat uses transpose convolutions, also known as a deconvolutionallayers. https://stats385.github.io/architectures
12/68
Recurrent neural networks (RNN)
RNNs are built on the same computational unit as the feed forwardneural network. RNNs do not have to be organized in layers anddirected cycles are allowed. This allows them to have internalmemory and as a result to process sequential data. One can convertan RNN into a regular feed forward neural network by "unfolding" it intime, as depicted in the figures.https://stats385.github.io/architectures
13/68
Optimization algorithms in Deep learning
随机梯度类算法
pytorch/caffe2里实现的算法有adadelta, adagrad, adam,nesterov, rmsprop, YellowFinhttps://github.com/pytorch/pytorch/tree/master/caffe2/sgd
pytorch/torch里有:sgd, asgd, adagrad, rmsprop, adadelta,adam, adamaxhttps://github.com/pytorch/pytorch/tree/master/torch/optim
tensorflow实现的算法有:Adadelta, AdagradDA, Adagrad,ProximalAdagrad, Ftrl, Momentum, adam, Momentum,CenteredRMSProp具体实现:https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/training_ops.cc
14/68
Stochastic Algorithms in Deep learning
Consider problem minx∈Rd1n
∑ni=1 fi(x)
References: chapter 8 inhttp://www.deeplearningbook.org/
Gradient descent
xt+1 = xt − αt
n
n∑i=1
∇fi(xt)
Stochastic gradient descent
xk+1 = xt − αt∇fi(xt)
SGD with momentum
vt+1 = µtvt − αt∇fi(xt)
xt+1 = xt + vt+1
15/68
Stochastic Algorithms in Deep learning
Adaptive Subgradient Methods(Adagrad): let gt = ∇fi(xt),g2
t = diag[gtgTt ] ∈ Rd, and initial G1 = g2
1. At step t
xt+1 = xt − αt√
Gt + ε1d∇fi(xt)
Gt+1 = Gt + g2t+1
in the upper and the following iterations we use element-wisevector-vector multiplication.
16/68
Reinforcement Learning
AlphaGo: supervised learning + policy gradients + valuefunctions + Monte-Carlo tree search
17/68
What is RL
Reinforcement learning
Branch of machine learning concerned with taking sequences ofactions Usually described in terms: Policies (select next action), Valuefunctions (measure goodness of states or state-action pairs),Dynamics Models (predict next states and rewards)
is learning what to do–how to map situations to actions–so as tomaximize a numerical reward signal. The decision-maker is called theagent, the thing it interacts with, is called the environment.
A reinforcement learning task that satisfies the Markov property iscalled a Markov Decision process, or MDP
We assume that all RL tasks can be approximated with Markovproperty. So this talk is based on MDP.
18/68
Agent and Environment
The agent selects actions based on the observations and rewardsreceived at each time-step t
The environment selects observations and rewards based on theactions received at each time-step t
19/68
Some applications
20/68
Definition
A Markov Decision Process is a tuple (S,A,P, r, γ):S is a finite set of states, s ∈ SA is a finite set of actions, a ∈ AP is the transition probability distribution.probability from state s with action a to state s′: P(s′|s, a)also called the model or the dynamicsr is a reward function, r(s, a, s′)sometimes just r(s) or ra
sor rt after time step tγ ∈ [0, 1] is a discount factor
21/68
Tetris
State: the current board, the current falling tileAction: Rotate or shift the falling shapeOne-step reward: if a level is cleared by the current action,reward 1, o.w. reward 0;Transition Probability: future tile is uniformly distributed
22/68
Optimization
Maximize the expected total discounted return of an episode
maxπ
E[
∞∑t=0
γtrt|π]
or,maxπ
E[R(τ)|τ = s0, a0, s1, a1, ... ∼ π]
Policy π(a|s) is a probability
23/68
Linear Programming Formulation
Solving Markov Decision Processes (MDP) can be formulated as LP:Primal
max∑
a
raxa
s.t. ∀s ∈ S,∑a∈A
xa = 1 + γPa,sxa
x ≥ 0
Dualmin
∑s
vs
s.t. ∀s ∈ S, a ∈ A, vs ≥ ra + γ∑
s′Pa,s′vs′
24/68
线性规划,半定规划,锥规划
Primal
min c>x
s.t. Ax = b
x ∈ K
Dual
max b>y
s.t. A>y + s = c
s ∈ K∗
“大”问题:LP: K is the nonnegative orthantvector x ∈ Rn, millions of variables
SOCP: K is the circular or Lorentz conevector x ∈ Rn, millions of variables
SDP: K is the cone of positive semidefinite matrices matrixx ∈ Sn×n, n up to 5000
25/68
Compressive Sensing
Find the sparest solutionGiven n=256, m=128.A = randn(m,n); u = sprandn(n, 1, 0.1); b = A*u;
50 100 150 200 250−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
min
x‖x‖0
s.t. Ax = b(a) `0-minimization
50 100 150 200 250−0.6
−0.4
−0.2
0
0.2
0.4
min
x‖x‖2
s.t. Ax = b(b) `2-minimization
50 100 150 200 250−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
min
x‖x‖1
s.t. Ax = b(c) `1-minimization
26/68
Wavelets and Images (Thanks: Richard Baraniuk)
27/68
Wavelet Approximation (Thanks: Richard Baraniuk)
28/68
Compressive sensing
Given (A, b,Ψ), find the sparsest point:
x∗ = arg min‖Ψx‖0 : Ax = b
From combinatorial to convex optimization:
x = arg min‖Ψx‖1 : Ax = b
1-norm is sparsity promotingBasis pursuit (Donoho et al 98)Many variants: ‖Ax− b‖2 ≤ σ for noisy b
Theoretical question: when is ‖ · ‖0 ↔ ‖ · ‖1 ?
29/68
Restricted Isometry Property (RIP)
Definition (Candes and Tao [2005])Matrix A obeys the restricted isometry property (RIP) with constant δs
if(1− δs)‖c‖2
2 ≤ ‖Ac‖22 ≤ (1 + δs)‖c‖2
2
for all s-sparse vectors c.
Theorem (Candes and Tao [2006])If x is a k-sparse and A satisfies δ2k + δ3k < 1, then x is the unique `1minimizer.
RIP essentially requires that every set of columns with cardinalityless than or equal to s behaves like an orthonormal system.
30/68
MRI: Magnetic Resonance Imaging
(a) MRI Scan (b) Fourier Coefficients (c) Image
Is it possible to cut the scan time into half?
31/68
MRI (Thanks: Wotao Yin)
MR images often have sparse sparse representations undersome wavelet transform Φ
Solvemin
u‖Φu‖1 +
µ
2‖Ru− b‖2
R: partial discrete Fourier transformThe higher the SNR (signal-noise ratio) is, the better the imagequality is.
(a) full sampling (b) 39% sampling,SNR=32.2
32/68
MRI: Magnetic Resonance Imaging
(a) full sampling (b) 39% sampling,SNR=32.2
(c) 22% sampling,SNR=21.4
(d) 14% sampling,SNR=15.8
33/68
The Netflix Prize
Training data100 million ratings, 480,000 users, 17,770 movies6 years of data: 2000-2005
Test dataLast few ratings of each user (2.8 million)Evaluation criterion: root mean squared error (RMSE)Netflix Cinematch RMSE: 0.9514
Competition2700+ teams$1 million prize for 10% improvement on Cinematch
34/68
Netflix Problem: 1 million dollar award
Given m movies x ∈ X and n customers y ∈ Ypredict the “rating” W(x, y) of customer y for movie xtraining data: known ratings of some customers for some moviesGoal: complete the matrixother applications: collaborative filtering, system identification,etc.
35/68
Matrix Rank Minimization
Given X ∈ Rm×n, A : Rm×n → Rp, b ∈ Rp, we considermatrix completion problem:
min rank(X), s.t. Xij = Mij, (i, j) ∈ Ω
nuclear norm minimization:
min ‖X‖∗ s.t. A(X) = b
where ‖X‖∗ =∑
i σi and σi = ith singular value of matrix X.SDP reformulation
min Tr(W1) + Tr(W2)
s.t.(
W1 XX> W2
) 0
A(X) = b
36/68
Video separation
Partition the video into moving and static parts
37/68
Sparse and low-rank matrix separation
Given a matrix M, we want to find a low rank matrix W and asparse matrix E, so that W + E = M.Convex approximation:
minW,E‖W‖∗ + µ‖E‖1, s.t. W + E = M
Robust PCA
38/68
图与网络流问题
39/68
图与网络流问题
2014 ACM SIGMOD Programming ContestShortest Distance Over Frequent Communication Paths定义社交网络的边: 相互直接至少有x条回复并且相互认识。给定网络里两个人p1和p2以及另外一个整数x,寻找图中p1和p2之间数量最少节点的路径
Interests with Large Communities
Socialization Suggestion
Most Central People (All pairs shorted path)定义网络:论坛中有标签t的成员,相互直接认识。给定整数k和标签t,寻找k个有highest closeness centrality values的人
40/68
DIMACS Implementation Challenges
http://dimacs.rutgers.edu/Challenges/
2014, 11th: Steiner Tree Problems2012, 10th: Graph Partitioning and Graph Clustering2005, 9th: The Shortest Path Problem2001, 8th: The Traveling Salesman Problem2000, 7th: Semidefinite and Related Optimization Problems1998, 6th: Near Neighbor Searches1995, 5th: Priority Queues, Dictionaries, and Multi-Dim. PointSets1994, 4th: Two Problems in Computational Biology: FragmentAssembly and Genome Rearrangements1993, 3rd: Effective Parallel Algorithms for CombinatorialProblems1992, 2nd: Maximum Clique, Graph Coloring, and Satisfiability1991, 1st: Network Flows and Matching
41/68
Combinatorial optimization problems
Maximum cut problemgiven weights W for all edges in a graph, find the partition of thenodes of G into two sets V1 and V2 so that the sum of the weightsof the edges between V1 and V2 is maximal
Maximum k-cut problem (Frequency assignment problem)given a graph G, assign colors to the vertices in such a way thatthe number of non-defect edges (with endpoints of differentcolors) is maximal
Maximum stable set problemsX (G): clique number of a graph G = largest clique in Gω(G): coloring number of a graph G = smallest number of colorsneeded to color a graph, so that vertices connected by an edgehave different colorsθ(G) (θ+(G)): Lovasz theta function of G: defined by a SDP
ω(G) ≤ θ(G) ≤ θ+(G) ≤ X (G).
42/68
Semidefinite programming (SDP) relaxations
Maximum cut problem
minX∈Sn〈C,X〉 s.t. Xi,i = 1, X 0
Frequency assignment problem (Maximum k-cut problem)
min⟨ 1
2k diag(We) + k−12k W,X
⟩s.t. Xi,j ≥ −1
k−1 , ∀(i, j) ∈ E\T, Xi,j = −1k−1 , ∀(i, j) ∈ T,
diag(X) = e, X 0.
Maximum stable set problems
θ(G) := max⟨ee>,X
⟩s.t. Xij = 0, (i, j) ∈ E, 〈I,X〉 = 1,X 0,
θ+(G) := max⟨ee>,X
⟩s.t. Xij = 0, (i, j) ∈ E, 〈I,X〉 = 1,X 0,X ≥ 0
43/68
Communities in the Networks
Many networks have community structures. Nodes in the samecluster have high connection intensity.
Figure: https://www.slideshare.net/NicolaBarbieri/community-detection
44/68
Communities in the Networks
Figure: Simmons College Facebook Network, the four clusters are labeledby different graduation year: 2006 in green, 2007 in light blue, 2008 inpurple and 2009 in red. Figure from Chen, Li and Xu, 2016.
45/68
Partition Matrix and Assignment Matrix
For any partition ∪ka=1Ca = [n], define the partition matrix X
Xij =
1, if i, j ∈ Ca, for some a,0, else .
Low rank solution
X =
11 1 11 1 11 1 1
1 11 1
=
1111
11
×1
1 1 11 1
46/68
Modularity Maximization
The modularity (MEJ Newman, M Girvan, 2004) is defined by
Q = 〈A− 12λ
ddT ,X〉
where λ = |E|.The Integral modularity maximization problem:
max 〈A− 12λddT ,X〉
s.t. X ∈ 0, 1n×n is a partiton matrix.
Probably hard to solve.
47/68
Convex Relaxation
doubly nonnegative relaxation of modularity maximization:
min 〈−A +1
2λddT ,X〉
s.t. Xii = 1, i = 1, . . . , n,
X 0,
0 ≤ Xij ≤ 1
This is an SDP problemwith theoretical guarantee under certain conditionsBut it is a huge-scale problem to solve
48/68
Dimensionality Reduction
Given n data points Xi ∈ Rp and assume that this dataset has intrinsicdimensionality d where d p. How transform the data set X into anew dataset Y with dimensionality d?
max∑
ij
‖yi − yj‖2
s.t. ‖yi − yj‖2 = ‖xi − xj‖2.
Maximal variance unfolding (MVU):
max Tr(K)
s.t. Kii + Kjj − 2Kij = ‖xi − xj‖2∑ij
Kij = 0
K 0.
which is SDP problem.
49/68
Supporting Vector Machine (SVM)
Suppose we have two-class discrimination data. We assign the firstclass with 1 and the second with 1. A powerful discrimination methodis the Supporting Vector Machine (SVM). With yi = 1 (in the firstclass), let the data point i be given by ai ∈ Rd, i = 1, . . . , n1, and withyi = −1 (in the second class), let the data point j be given by bj ∈ Rd,j = 1, . . . , n2. We wish to find a hyper-plane in Rd to separate ais frombjs. Mathematically, we wish to find ω ∈ Rd and β ∈ R such that
a>i ω + β > 1,∀i
andb>j ω + β < −1, ∀j,
where x : ω>x + β = 0 is the desired separating hyper-plane. This is alinear program!
50/68
Supporting Vector Machine (SVM)
When the data is noisy. Then we solve
min∑
i
|πi|+∑
j
|σj|
s.t. a>i ω + β ≥ 1− πi,∀ib>j ω + β ≤ −1 + σj,∀jπ, σ ≥ 0
which is a linear program, or
min ω>ω + µ
∑i
π2i +
∑j
σ2j
s.t. a>i ω + β ≥ 1− πi,∀i
b>j ω + β ≤ −1 + σj,∀jπ, σ ≥ 0
which is a quadratic program.
51/68
Supply chain
52/68
Supply chain
53/68
facility localization
54/68
Medical imaging
55/68
Medical imaging
56/68
Medical imaging
57/68
Phase Retrieval
Phase carries more information than magnitude
Y |F(Y)| phase(F(Y)) iF(|F(Y)|.*phase(F(S)))
S |F(S)| phase(F(S)) iF(|F(S)|.*phase(F(Y)))
Question: recover signal without knowing phase?
58/68
Classical Phase Retrieval
Feasibility problem
find x ∈ S ∩M or find x ∈ S+ ∩M
given Fourier magnitudes:
M := x(r) | |x(ω)| = b(ω)
where x(ω) = F(x(r)), F : Fourier transformgiven support estimate:
S := x(r) | x(r) = 0 for r /∈ D
orS+ := x(r) | x(r) ≥ 0 and x(r) = 0 if r /∈ D
59/68
Ptychographic Phase Retrieval (Thanks: Chao Yang)
Given bi = |F(Qiψ)| for i = 1, . . . , k, can we recover ψ?
Ptychographic imaging along with advances in detectors andcomputing have resulted in X-ray microscopes with increased spatialresolution without the need for lenses
60/68
Recent Phase Retrieval Model Problems
Given A ∈ Cm×n and b ∈ Rm
find x, s.t. |Ax| = b.
(Candes et al. 2011b, Alexandre d’Aspremont 2013)
SDP Relaxation: |Ax|2 is a linear function of X = xx∗
minX∈Sn
Tr(X)
s.t. Tr(aia∗i X) = b2i , i = 1, . . . ,m,
X 0
Exact recovery conditions
61/68
Single Particle Cryo-Electron Microscopy
Projection images Pi(x, y) =∫∞−∞ φ(xR1
i + yR2i + zR3
i )dz+ “noise”.
φ : R3 → R is the electric potential of the molecule.Cryo-EM problem: Find φ and R1, . . . ,Rn given P1, . . . ,Pn.
62/68
A Toy Example
63/68
Single Particle Cryo-Electron Microscopy
Least Squares Approach
minR1,...,RK∈SO(3)
∑i 6=j
wij
∥∥∥Ri (~cij, 0)T − Rj (~cji, 0)T∥∥∥2
Since∥∥∥Ri (~cij, 0)T
∥∥∥ =∥∥∥Rj (~cji, 0)T
∥∥∥ = 1, we obtain
maxR1,...,RK∈SO(3)
∑i 6=j
wij〈Ri (~cij, 0)T ,Rj (~cji, 0)T〉
SDP Relaxation:
maxG∈R2K×2K trace ((W S) G)
s.t. Gii = I2, i = 1, 2, · · · ,K,G < 0
64/68
Randomized Linear Algebra Algorithms
(Thanks: Petros Drineas and Michael W. Mahoney)Goal: To develop and analyze (fast) Monte Carlo algorithms forperforming useful computations on large (and later not so large!)matrices and tensors.
Matrix MultiplicationComputation of the Singular Value DecompositionComputation of the CUR DecompositionTesting Feasibility of Linear ProgramsLeast Squares ApproximationTensor computations: SVD generalizationsTensor computations: CUR generalization
Such computations generally require time which is superlinear in thenumber of nonzero elements of the matrix/tensor, e.g., O(n3) for n× nmatrices.
65/68
CPU vs. GPU
March 1, 2015:Intel Xeon Processor E7-8880 v2 15 cores, $5729.00Intel Xeon Phi Coprocessor 7120A, 61 cores, $4235.00Tesla K80, 4992 ( 2496 per GPU) cores, $4959.99
66/68
Super Computer
http://www.top500.org/lists/2014/11/
Top 1: Tianhe-2 (MilkyWay-2): TH-IVB-FEP Cluster, Intel XeonE5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P,NUDT, 3,120,000 cores#24: Edison is NERSC’s newest supercomputer, a Cray XC30,with a peak performance of 2.57 petaflops/sec, 133,824 computecores, 357 terabytes of memory, and 7.56 petabytes of disk.
67/68
Optimization Formulation
Mathmatical optimization problem
minf (x)
s.t. ci(x) = 0, i ∈ Eci(x) ≥ 0, i ∈ I
x = (x1, . . . , xn)>: variablef (x) : Rn → R: objective functionci(x) : Rn → R: constraintsoptimal solution x∗: a feasible point with the smallest value of f
68/68
Classification
Continuous versus discrete optimizationUnconstrained versus constrained optimizationGlobal and local optimizationStochastic and deterministic optimizationLinear/nonlinear/quadratic programming, Convex/nonconvexoptimizationLeast square problem, equation solvingsparse optimization, PDE-constrained optimization, robustoptimization