from pixel to visual intelligencevalser.org/2017/ppt/vooc/valse_2017_lcw.pdf · yao xiao, cewu lu,...

From pixel to Visual Intelligence

Speaker: Cewu Lu (卢策吾)

Shanghai Jiaotong University

上海交通大学

•About me.

•My understanding of Computer Vision Big Picture .

•My research at that Big Picture.

Outline

About Me

• Professor • Ph.D supervisor• 1000 talents oversee (国家青年千人计划)

Machine Vision and Intelligence Group

Before I joined SJTU

Postdoc and research Fellow at

Prof. Fei-fei LiDirector of Stanford AI lab

Prof. Leonidas J. GuibasNAE（美国工程院院士）

• Stanford-Toyota Self-Driving Cars(斯坦福-丰田无人车) core member

• Publish (accepted) 21 CVPR/ICCV/PAMI/IJCV (77% first author), CCF Apaper 31

• Most cited paper SIGGRAPH in recent 5 years among 1000+ papers.

• Two papers are included in OpenCV

About Me

Computer Vision

Machine can See

NSF while paper: Let machine see like human

Computer Vision

Machine can See

Pixel level Patch level Human Understanding

Object level Super object

[SIFT Feature, 2004]

[Deep Learning, 2012]

Image

Video

RGBD

Scene UnderstandingObject Detection

Fine-gained

Event Understanding

Action Recognition

Gradient Processing

Image Abstraction

Stereo Deblur

DenoisePatch Representation

Tracking

Face

3D reconstruction

Visual QA

Image2catpion

Video storying

Video storying


Object level(recognition)

SaliencyScene Parsing

Point cloud segmentation

Image

Video

RGBD

Scene UnderstandingObject Detection

Fine-gained

Event Understanding

Action Recognition

Gradient Processing

Image Abstraction

Stereo Deblur

DenoisePatch Representation

Tracking

Face

3D reconstruction

Visual QA

Image2catpion

Video storying

Video storying


Object level(recognition)

SIGRAPHA ASIA

SIGRAPHA ASIA

IJCV CVPR

ICCV

CVPR

CVPR

TIP

CVPR

CVPRTIP

CVPR

CVPR

ECCV

CVPR

ICCV

ICCV

ICCV

ICCVICCV

ICCV

ICCV

TVCG

PAMI

PAMI

PAMI

IJCV

ICCP

My Research Work

Representative Work on Patch Level

L0-norm smoothing

Cewu Lu*, Li Xu*, Yi Xu, Jiaya Jia , "Image Smoothing via L0 Gradient Minimization“,ACM Transactions on Graphics, Vol. 30, No. 5 (SIGGRAPH Asia 2011) * Indicates co-first author

Main Structure Extraction

Smoothing result

Ours

Extracted Edge

Canny

Extracted Edge

Stationary Estimation

L0 Regularized Stationary Time Estimation for Crowd Group Analysis, [CVPR 2014] [PAMI 2016]

Abnormal Event Detection at 1000 FPS

[Cewu Lu et al. ICCV]

Cewu Lu, Jianping Shi, Jiaya Jia. Abnormal Event Detection at 150 FPS in MATLAB,IEEE International Conference on Computer Vision [ICCV 2013] [IJCV 2017]

Results (UCSD Ped1 Dataset)

MPPCA: [Mahadevan et al. 2009] MPPCA+SF: [Mahadevan et al. 2009] SF: [Mahadevan et al. 2009] MDT: [Mahadevan et al. 2009] Sparse: [Cong et al. 2011] Adam: [Adam et al. 2008]

Pixel-level comparison. FPR: False Positive Rat. TPR: True Positive Rate. Subspace: replacing our combination learning by [Ehsan et al. 2009].

Results

Sec per Frame Platform CPU Memory

[Mahadevan et al. 2009] 25 - 3.0GHz 2.0GB

[Cong et al. 2011] 3.8 - 2.6GHz 2.0GB

[Antic et al. 2011] 10 MATLAB - -

Our 0.00098 MATLAB 2012 3.4GHz 8.0GB

Testing time comparison on the UCSD Ped1 dataset.

Results

Sec per Frame Platform CPU Memory

[Mahadevan et al. 2009] 25 - 3.0GHz 2.0GB

[Cong et al. 2011] 3.8 - 2.6GHz 2.0GB

[Antic et al. 2011] 10 MATLAB - -

Our 0.00098 MATLAB 2012 3.4GHz 8.0GB

Testing time comparison on the UCSD Ped1 dataset.

Others

Ours

Results

Representative Work on Object Level

Personal Object Discovery[Cewu Lu et al. TIP]

Object Scene Distribution

Highlight Projects (Personal Object Discovery)

Cewu Lu, Renjie Liao, Jiaya Jia , “Personal Object Discovery“, IEEE Transactions on Image Processing.

Weather Understanding[Cewu Lu et al. CVPR 2014][Cewu Lu et al. TPAMI2014]

Highlight Projects (Weather classification)

Cewu Lu, Di Lin, Jiaya Jia, Chi-Keung Tang, “Two-class Weather Classification“, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2014, (TPAMI) 2017.

Real-Time Video Stylization Using Object FlowsCewu Lu Yao Xiao and Chi-Keung TangIEEE Transactions on Visualization and Computer Graphics (TVCG), 2017

Combining Sketch and Tone for Pencil Drawing Production.Cewu Lu, Li Xu, Jiaya Jia.Non-Photorealistic Animation and Rendering (NPAR), 2012(Best Paper Award).

Cewu Lu et al. Real-Time Video Stylization Using Object Flows

Papers （Object Detection）

Cewu Lu, Hao Chen, Qifeng Chen, Hei Law, Yao Xiao, Chi-Keung Tang ECCV 2014 workshop - ImageNet Large Scale Visual Recognition Challenge

Di Lin, Xiaoyong Shen, Cewu Lu, Jiaya Jia, Deep LAC: Deep Localization, Alignment and Classification for Fine-grained Recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015.

Yao Xiao, Cewu Lu, Chi-Keung Tang, Complexity-Adaptive Distance Metric for Object Proposals Generation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015.

Cewu Lu, Yongyi Lu, CK Tang, Efficient Square Localization for Efficient and Accurate Object Detection, submitted to IEEE International Conference on Computer Vision (ICCV), 2015.

Cewu Lu, Yongyi Lu, CK Tang, Explicit Closed-Curve Optimization for Objectness Estimation , submitted to IEEE International Conference on Computer (ICCV), 2015.

Cewu Lu, Yongyi Lu, CK Tang, Unobjectness for Object Proposals Generation, submitted to IEEE International Conference on Computer Vision (ICCV), 2015.


Object level

[Deep Learning 2012]

StoryNoun (名词)

Sentence (句子)

Phrase(短语)

verb(动词)Natural Language Understanding

Computer Vision

Comparison to NLP


Object level

[Deep Learning 2012]

StoryNoun (名词)

Sentence (句子)

Phrase(短语)

verb(动词)Natural Language Understanding

Computer Vision

Comparison to NLP

What can I do here?

Representative Work on Beyond Object Level

Visual Relationship Detection with Language PriorsCewu Lu, Ranjay Krishna, Michael Bernstein, Li Fei-FeiECCV 2016 (oral) (reported by ECCV daily)

Detecting <Subject, Predicate, Object> （<主，谓，宾>）

Difficulties

（1）detection errors by individual is huge （under 5%）

（2）Training data is sparse

主谓宾

主

谓

宾

主谓宾主谓宾

100类 70类 100类 70万类

Difficulties

（1）detection errors by individual is huge （under 5%）

（2）Training data is sparse

Linkage from Language Prior

Person-ride-horse

Person-ride-elephant

Person-ride-moto

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjIkbPoz7DTAhWm5YMKHR4TDJ8QjRwIBw&url=http://blog.sina.com.cn/s/blog_88272f6f01019ifz.html&psig=AFQjCNEPSSbvHTM67QV742lLvwnekhuj4Q&ust=1492694778890506

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwjLyLX7z7DTAhVI74MKHVOZAJ8QjRwIBw&url=http://cnnews.chosun.com/client/news/viw.asp?cate=C04&mcate=M1001&nNewsNumb=20130937391&nidx=37391&psig=AFQjCNE3o_Ie-q5EVCHTIvjCud9wHV5VAQ&ust=1492694897921006

• Discover and predict relationships in image.

Mining relationship tuples:<man, wear, glass>

<man, carry, bag>

<Car, on, ground>

<trash bin, next to, streetlight>

…………

Some Results

Using relationship: Human-ride-horse

Accuracy

Slide for more details: http://cs.stanford.edu/people/ranjaykrishna/vrd/slides.pdf

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwisj-HgtK7TAhUj54MKHQvzDesQjRwIBw&url=https://www.semanticscholar.org/paper/Visual-Relationship-Detection-with-Language-Priors-Lu-Krishna/19150b001031cc6d964e83cd28553004f653cc24&psig=AFQjCNFDYVLZZa_hOn3kJQGeqGWe0KzlZQ&ust=1492618852738877

人人

A problem: miss sub-object level information!

Leg stamps on somethingScale pan is stamped by something

Beneath Holistic Object Recognition

Richer semantics on parts helps to infer the story.

sth sits on saddlewheel in the airwheel on sthsth holds handlebar

sth touches headleg in the airleg on sthtorso wears sth

head with bridle reinsth rides torsotorso wears sthleg in the airleg on sth

hand embraces sthtorso sits on sthleg is bent

sth sits on saddle sth hold handlebar.wheel on sthwheel on sth

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

(a) (b) (c)

(d) (e)

Beyond Holistic Object Recognition: Enriching Image Understanding with Part StatesCewu Lu et. al (with Stanford University) arXiv:1612.07310

Beyond Holistic Object Recognition

Regional Multi-person Pose EstimationHaoshu Fang，Shuqin Xie，Cewu Lu (通信作者)

arXiv:1612.00137v2

SST network

STN: spatial Transform NetworkSDTN： spatial de-transform networkSPPE: single person pose estimation

Comparison

“CMU” indicates Real-time Multi-Person 2D Pose Estimation using Part Affinity Fields,Cao et al. CVPR 2017

MPII COCO

Ours 77.4 64.7

CMU 75.6 61.8

Computer Vision

Machine can See


Object level Super object

Part level

Computer Vision Big Picture

Machine can See

Machine can Act

Without Action…

Without Action…

To acquire perception, we need daily action indeed!

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., et al. (2016, February 5). Asynchronous Methods for Deep Reinforcement Learning. arXiv.org.Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2016, September 17). Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement

Learning. arXiv.org.

Reinforcement Learning

Suiqin Xie, Cewu Lu(通信作者) Reinforcement learning for pose estimation

Yourong You, Cewu Lu (通信作者)，Reinforcement Learning Car for self-driving

Learning Step1. Low speed straight2. Low speed curve3. Stuck4. High Speed straight5. Low speed curve6. Collision

Yourong You, Cewu Lu (通信作者)，Reinforcement Learning for Car self-driving

Virtual to Real Reinforcement Learning for Autonomous Driving

Virtual to Real Reinforcement Learning for Autonomous Driving (with Berkeley )

Yurong You, Xinlei Pan,Ziyan Wang and Cewu Lu, arXiv:1704.03952v1

B-RL: training the vehicle in the virtual car racing simulator TORCS [31] with virtual image as input

method Ours B-RL

result 43.40% 28.33%

https://arxiv.org/abs/1704.03952v1

增强学习的痛点：交互！交互！交互！

怎么都是虚拟的！

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=&url=https://www.nervanasys.com/demystifying-deep-reinforcement-learning/&psig=AFQjCNFUSAGuoD0CecI6YwWRtfsrJRqTsA&ust=1492698803935246

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwiF8aiH37DTAhUs64MKHfTzD1YQjRwIBw&url=http://www.expar.cn/57404.html&psig=AFQjCNH3hNWlWYoQKk_moPkFZO0I-Vl3ig&ust=1492698943299564

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwiqjtbSyLTTAhVGWBQKHW3XCmgQjRwIBw&url=http://www.itdadao.com/articles/c15a121301p0.html&psig=AFQjCNETx2q3ulcepAuzCKMJCl7BCtoBqQ&ust=1492830351386148

Visual Intelligence Big Picture

Machine can See

Machine can Act Machine has Knowledge

ShapeNet (Stanford, Princeton, Adobe )

A Scalable Active Framework for Region Annotation in 3D Shape CollectionsACM Transactions on Graphics (ACM SIGGRAPH ASIA 2016)(With Stanford, Adobe, UCB)

editable Real-world

Promising to one-shot learning

Unsupervised Image Group Distance InferenceZhengTian Xu, Cewu Lu(通信作者) will submit to arXiv soon

Unsupervised Image Group Distance InferenceZhengTian Xu, Cewu Lu(通信作者)


See

Act Knowledge

See: finer and finerObject recognition (2013)Detection (2014)Segmentation (2015)Part level such as pose (2016)

My goal: (1) information exploration beyond object level to mine high-level semantics andbetter object level recognition (partly solve long-tail 长尾效应).


See

Act Knowledge


My direction: information exploration beyond object level to mine high-level semantics andbetter object level recognition (partly reduce long-tail effect 长尾效应).

只不是增加数据的数量，而是数据的深度（信息量）！

https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0ahUKEwig6vivjbPTAhWkxYMKHUN_CaEQjRwIBw&url=http://netprophetblog.blogspot.com/2015/01/the-effect-of-additional-data-on.html&psig=AFQjCNH5vQuW-3zqGro7FwiZg2Fw-YK4yw&ust=1492780090127673


See

Act Knowledge


My direction: information exploration beyond object level to mine high-level semantics andbetter object level recognition (partly solve long-tail 长尾效应).


See

Act Knowledge


Challenging:

(1) how to benchmark we visually understand the work?

subject part 主观(task driving) + objective part 客观 (doing that)

My thinking: leave to Act

(2) How to obtain low-shot (even one-shot) learning?

My thinking: leave to Knowledge

我们实验室在招生。。。求扩散。。。

My Research Directions

Machine can See


Better performance (deep learning)!Sub- and super object levelIn Video and Image

• Real-world interaction • Learning speed • Reward function (inverse RL)• Huge action space

one-shot learning by O(1) effortVisual Knowledge base (self-learning and scale-up)

Applications

Machine can See


11 students: Pose estimationVideo action understandingVisual relationshipObject detectionDeep Learning on mobile phone

9 students:Auto-carRobot armAuto-navigation

5 Students

发邮件到这里[email protected]

2018年入学，硕士，博士博士后（工资好说，不差钱）

福利：推荐北美名校暑假实习今年成功推荐：斯坦福（vision group），麻省理工， CMU

目前组里成员有来自：上海交大ACM班，复旦ACM队中科大少年班，浙大竺可桢学院。人均1.6次国家奖学金。

目前2018年入学，发了两个offer，上交计算机系前三名（一作ICCV 2017），上交电子系前三名, 目前还有名额。。。

欢迎实习！• 目前实习过的学生包括加州伯克利，香港科技大学，浙江大学。我们提供住宿

Thanks!

欢迎关注我们实验室公众号MVIG@SJTU

from pixel to visual intelligencevalser.org/2017/ppt/vooc/valse_2017_lcw.pdf · yao xiao, cewu lu,...

Documents