from pixel to visual intelligencevalser.org/2017/ppt/vooc/valse_2017_lcw.pdf · yao xiao, cewu lu,...
TRANSCRIPT
From pixel to Visual Intelligence
Speaker: Cewu Lu (卢策吾)
Shanghai Jiaotong University
上海交通大学
•About me.
•My understanding of Computer Vision Big Picture .
•My research at that Big Picture.
Outline
About Me
• Professor • Ph.D supervisor• 1000 talents oversee (国家青年千人计划)
Machine Vision and Intelligence Group
Before I joined SJTU
Postdoc and research Fellow at
Prof. Fei-fei LiDirector of Stanford AI lab
Prof. Leonidas J. GuibasNAE(美国工程院院士)
• Stanford-Toyota Self-Driving Cars(斯坦福-丰田无人车) core member
• Publish (accepted) 21 CVPR/ICCV/PAMI/IJCV (77% first author), CCF Apaper 31
• Most cited paper SIGGRAPH in recent 5 years among 1000+ papers.
• Two papers are included in OpenCV
About Me
Computer Vision
Machine can See
NSF while paper: Let machine see like human
Computer Vision
Machine can See
Pixel level Patch level Human Understanding
Object level Super object
[SIFT Feature, 2004]
[Deep Learning, 2012]
Image
Video
RGBD
Scene UnderstandingObject Detection
Fine-gained
Event Understanding
Action Recognition
Gradient Processing
Image Abstraction
Stereo Deblur
DenoisePatch Representation
Tracking
Face
3D reconstruction
Visual QA
Image2catpion
Video storying
Video storying
Pixel level Patch level Human Understanding
Object level(recognition)
SaliencyScene Parsing
Point cloud segmentation
Image
Video
RGBD
Scene UnderstandingObject Detection
Fine-gained
Event Understanding
Action Recognition
Gradient Processing
Image Abstraction
Stereo Deblur
DenoisePatch Representation
Tracking
Face
3D reconstruction
Visual QA
Image2catpion
Video storying
Video storying
Pixel level Patch level Human Understanding
Object level(recognition)
SIGRAPHA ASIA
SIGRAPHA ASIA
IJCV CVPR
ICCV
CVPR
CVPR
TIP
CVPR
CVPRTIP
CVPR
CVPR
ECCV
CVPR
ICCV
ICCV
ICCV
ICCVICCV
ICCV
ICCV
TVCG
PAMI
PAMI
PAMI
IJCV
ICCP
My Research Work
Representative Work on Patch Level
L0-norm smoothing
Cewu Lu*, Li Xu*, Yi Xu, Jiaya Jia , "Image Smoothing via L0 Gradient Minimization“,ACM Transactions on Graphics, Vol. 30, No. 5 (SIGGRAPH Asia 2011) * Indicates co-first author
Main Structure Extraction
Smoothing result
Ours
Extracted Edge
Canny
Extracted Edge
Stationary Estimation
L0 Regularized Stationary Time Estimation for Crowd Group Analysis, [CVPR 2014] [PAMI 2016]
Abnormal Event Detection at 1000 FPS
[Cewu Lu et al. ICCV]
Cewu Lu, Jianping Shi, Jiaya Jia. Abnormal Event Detection at 150 FPS in MATLAB,IEEE International Conference on Computer Vision [ICCV 2013] [IJCV 2017]
Results (UCSD Ped1 Dataset)
MPPCA: [Mahadevan et al. 2009] MPPCA+SF: [Mahadevan et al. 2009] SF: [Mahadevan et al. 2009] MDT: [Mahadevan et al. 2009] Sparse: [Cong et al. 2011] Adam: [Adam et al. 2008]
Pixel-level comparison. FPR: False Positive Rat. TPR: True Positive Rate. Subspace: replacing our combination learning by [Ehsan et al. 2009].
Results
Sec per Frame Platform CPU Memory
[Mahadevan et al. 2009] 25 - 3.0GHz 2.0GB
[Cong et al. 2011] 3.8 - 2.6GHz 2.0GB
[Antic et al. 2011] 10 MATLAB - -
Our 0.00098 MATLAB 2012 3.4GHz 8.0GB
Testing time comparison on the UCSD Ped1 dataset.
Results
Sec per Frame Platform CPU Memory
[Mahadevan et al. 2009] 25 - 3.0GHz 2.0GB
[Cong et al. 2011] 3.8 - 2.6GHz 2.0GB
[Antic et al. 2011] 10 MATLAB - -
Our 0.00098 MATLAB 2012 3.4GHz 8.0GB
Testing time comparison on the UCSD Ped1 dataset.
Others
Ours
Results
Representative Work on Object Level
Personal Object Discovery[Cewu Lu et al. TIP]
Object Scene Distribution
Highlight Projects (Personal Object Discovery)
Cewu Lu, Renjie Liao, Jiaya Jia , “Personal Object Discovery“, IEEE Transactions on Image Processing.
Weather Understanding[Cewu Lu et al. CVPR 2014][Cewu Lu et al. TPAMI2014]
Highlight Projects (Weather classification)
Cewu Lu, Di Lin, Jiaya Jia, Chi-Keung Tang, “Two-class Weather Classification“, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2014, (TPAMI) 2017.
Real-Time Video Stylization Using Object FlowsCewu Lu Yao Xiao and Chi-Keung TangIEEE Transactions on Visualization and Computer Graphics (TVCG), 2017
Combining Sketch and Tone for Pencil Drawing Production.Cewu Lu, Li Xu, Jiaya Jia.Non-Photorealistic Animation and Rendering (NPAR), 2012(Best Paper Award).
Cewu Lu et al. Real-Time Video Stylization Using Object Flows
Papers (Object Detection)
Cewu Lu, Hao Chen, Qifeng Chen, Hei Law, Yao Xiao, Chi-Keung Tang ECCV 2014 workshop - ImageNet Large Scale Visual Recognition Challenge
Di Lin, Xiaoyong Shen, Cewu Lu, Jiaya Jia, Deep LAC: Deep Localization, Alignment and Classification for Fine-grained Recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015.
Yao Xiao, Cewu Lu, Chi-Keung Tang, Complexity-Adaptive Distance Metric for Object Proposals Generation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015.
Cewu Lu, Yongyi Lu, CK Tang, Efficient Square Localization for Efficient and Accurate Object Detection, submitted to IEEE International Conference on Computer Vision (ICCV), 2015.
Cewu Lu, Yongyi Lu, CK Tang, Explicit Closed-Curve Optimization for Objectness Estimation , submitted to IEEE International Conference on Computer (ICCV), 2015.
Cewu Lu, Yongyi Lu, CK Tang, Unobjectness for Object Proposals Generation, submitted to IEEE International Conference on Computer Vision (ICCV), 2015.
Pixel level Patch level Human Understanding
Object level
[Deep Learning 2012]
StoryNoun (名词)
Sentence (句子)
Phrase(短语)
verb(动词)Natural Language Understanding
Computer Vision
Comparison to NLP
Pixel level Patch level Human Understanding
Object level
[Deep Learning 2012]
StoryNoun (名词)
Sentence (句子)
Phrase(短语)
verb(动词)Natural Language Understanding
Computer Vision
Comparison to NLP
What can I do here?
Representative Work on Beyond Object Level
Visual Relationship Detection with Language PriorsCewu Lu, Ranjay Krishna, Michael Bernstein, Li Fei-FeiECCV 2016 (oral) (reported by ECCV daily)
Detecting <Subject, Predicate, Object> (<主,谓,宾>)
Difficulties
(1)detection errors by individual is huge (under 5%)
(2)Training data is sparse
主 谓 宾
主
谓
宾
主 谓 宾 主谓宾
100类 70类 100类 70万类
Difficulties
(1)detection errors by individual is huge (under 5%)
(2)Training data is sparse
Linkage from Language Prior
Person-ride-horse
Person-ride-elephant
Person-ride-moto
• Discover and predict relationships in image.
Mining relationship tuples:<man, wear, glass>
<man, carry, bag>
<Car, on, ground>
<trash bin, next to, streetlight>
…………
Some Results
Using relationship: Human-ride-horse
Accuracy
Slide for more details: http://cs.stanford.edu/people/ranjaykrishna/vrd/slides.pdf
人 人
A problem: miss sub-object level information!
Leg stamps on somethingScale pan is stamped by something
Beneath Holistic Object Recognition
Richer semantics on parts helps to infer the story.
sth sits on saddlewheel in the airwheel on sthsth holds handlebar
sth touches headleg in the airleg on sthtorso wears sth
head with bridle reinsth rides torsotorso wears sthleg in the airleg on sth
hand embraces sthtorso sits on sthleg is bent
sth sits on saddle sth hold handlebar.wheel on sthwheel on sth
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
(a) (b) (c)
(d) (e)
Beyond Holistic Object Recognition: Enriching Image Understanding with Part StatesCewu Lu et. al (with Stanford University) arXiv:1612.07310
Beyond Holistic Object Recognition
Regional Multi-person Pose EstimationHaoshu Fang,Shuqin Xie,Cewu Lu (通信作者)
arXiv:1612.00137v2
SST network
STN: spatial Transform NetworkSDTN: spatial de-transform networkSPPE: single person pose estimation
STN: spatial Transform NetworkSDTN: spatial de-transform networkSPPE: single person pose estimation
Comparison
“CMU” indicates Real-time Multi-Person 2D Pose Estimation using Part Affinity Fields,Cao et al. CVPR 2017
MPII COCO
Ours 77.4 64.7
CMU 75.6 61.8
Computer Vision
Machine can See
Pixel level Patch level Human Understanding
Object level Super object
Part level
Computer Vision Big Picture
Machine can See
Machine can Act
Without Action…
Without Action…
To acquire perception, we need daily action indeed!
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., et al. (2016, February 5). Asynchronous Methods for Deep Reinforcement Learning. arXiv.org.Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., & Farhadi, A. (2016, September 17). Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement
Learning. arXiv.org.
Reinforcement Learning
Suiqin Xie, Cewu Lu(通信作者) Reinforcement learning for pose estimation
Yourong You, Cewu Lu (通信作者),Reinforcement Learning Car for self-driving
Learning Step1. Low speed straight2. Low speed curve3. Stuck4. High Speed straight5. Low speed curve6. Collision
Yourong You, Cewu Lu (通信作者),Reinforcement Learning for Car self-driving
Virtual to Real Reinforcement Learning for Autonomous Driving
Virtual to Real Reinforcement Learning for Autonomous Driving (with Berkeley )
Yurong You, Xinlei Pan,Ziyan Wang and Cewu Lu, arXiv:1704.03952v1
B-RL: training the vehicle in the virtual car racing simulator TORCS [31] with virtual image as input
method Ours B-RL
result 43.40% 28.33%
增强学习的痛点:交互!交互!交互!
怎么都是虚拟的!
Visual Intelligence Big Picture
Machine can See
Machine can Act Machine has Knowledge
ShapeNet (Stanford, Princeton, Adobe )
A Scalable Active Framework for Region Annotation in 3D Shape CollectionsACM Transactions on Graphics (ACM SIGGRAPH ASIA 2016)(With Stanford, Adobe, UCB)
editable Real-world
Promising to one-shot learning
Unsupervised Image Group Distance InferenceZhengTian Xu, Cewu Lu(通信作者) will submit to arXiv soon
Unsupervised Image Group Distance InferenceZhengTian Xu, Cewu Lu(通信作者)
From pixel to Visual Intelligence
See
Act Knowledge
See: finer and finerObject recognition (2013)Detection (2014)Segmentation (2015)Part level such as pose (2016)
My goal: (1) information exploration beyond object level to mine high-level semantics andbetter object level recognition (partly solve long-tail 长尾效应).
From pixel to Visual Intelligence
See
Act Knowledge
See: finer and finerObject recognition (2013)Detection (2014)Segmentation (2015)Part level such as pose (2016)
My direction: information exploration beyond object level to mine high-level semantics andbetter object level recognition (partly reduce long-tail effect 长尾效应).
只不是增加数据的数量,而是数据的深度(信息量)!
From pixel to Visual Intelligence
See
Act Knowledge
See: finer and finerObject recognition (2013)Detection (2014)Segmentation (2015)Part level such as pose (2016)
My direction: information exploration beyond object level to mine high-level semantics andbetter object level recognition (partly solve long-tail 长尾效应).
From pixel to Visual Intelligence
See
Act Knowledge
See: finer and finerObject recognition (2013)Detection (2014)Segmentation (2015)Part level such as pose (2016)
Challenging:
(1) how to benchmark we visually understand the work?
subject part 主观(task driving) + objective part 客观 (doing that)
My thinking: leave to Act
(2) How to obtain low-shot (even one-shot) learning?
My thinking: leave to Knowledge
我们实验室在招生。。。求扩散。。。
My Research Directions
Machine can See
Machine can Act Machine has Knowledge
Better performance (deep learning)!Sub- and super object levelIn Video and Image
• Real-world interaction • Learning speed • Reward function (inverse RL)• Huge action space
one-shot learning by O(1) effortVisual Knowledge base (self-learning and scale-up)
Applications
Machine can See
Machine can Act Machine has Knowledge
11 students: Pose estimationVideo action understandingVisual relationshipObject detectionDeep Learning on mobile phone
9 students:Auto-carRobot armAuto-navigation
5 Students
发邮件到这里[email protected]
2018年入学,硕士,博士博士后(工资好说,不差钱)
福利:推荐北美名校暑假实习今年成功推荐:斯坦福(vision group),麻省理工, CMU
目前组里成员有来自:上海交大ACM班,复旦ACM队中科大少年班,浙大竺可桢学院。人均1.6次国家奖学金。
目前2018年入学,发了两个offer,上交计算机系前三名(一作ICCV 2017),上交电子系前三名, 目前还有名额。。。
欢迎实习!• 目前实习过的学生包括加州伯克利,香港科技大学,浙江大学。我们提供住宿
Thanks!
欢迎关注我们实验室公众号MVIG@SJTU