deep-learning image segmentation

108
Deep-Learning Image Segmentation Towards Tea Leaves Harvesting by Autonomous Machine by Antony Ducommun dit Boudry Bachelor of Science, HES-SO, August 2018 Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Bachelor of Science HES-SO in the department of Ingénierie des technologies de l’information Orientation Informatique matérielle Supervised by Dr. Lixin Ran, Zhejiang University, China Dr. Andres Upegui, HES-SO, Switzerland This thesis is licensed under Creative Commons Attribution-ShareAlike 4.0 International

Upload: others

Post on 24-Jan-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep-Learning Image Segmentation

Deep-Learning Image Segmentation

Towards Tea Leaves Harvesting by Autonomous Machine

byAntony Ducommun dit Boudry

Bachelor of Science, HES-SO, August 2018

Thesis Submitted in Partial Fulfillment of the Requirements for the Degree ofBachelor of Science HES-SO

in the department of Ingénierie des technologies de l’informationOrientation Informatique matérielle

Supervised byDr. Lixin Ran, Zhejiang University, ChinaDr. Andres Upegui, HES-SO, Switzerland

This thesis is licensed under Creative Commons Attribution-ShareAlike 4.0 International

Page 2: Deep-Learning Image Segmentation

Abstract

Tea beverages are made from leaves of an evergreen shrub called ”Camellia Sinen-sis”, native to east and southeast Asia as well as Indian subcontinent. To makethe highest quality Chinese tea, suitable young leaf and leaf bud combinationsare still picked by human hands. Harvesting requires a dedicated workforce andis getting more and more difficult to source in recent years near Hangzhou, Zhe-jiang province. Hence, ongoing researches are studying how to improve harvestefficiency and processes in China.In the scope of this project, the typical work environment on tea fields will bequalified, existing approaches analyzed and potential solutions outlined. The ob-jective of this research is to explore if and how an automated or semi-automatedharvesting machine could be guided to identify and localize suitable young tealeaves. Because the scope of research to achieve the complete objective is tremen-dously large, the analyzed solution in this paper is confined to the implementationand evaluation of a deep-learning model (CNN/U-Net) capable of segmentingsuitable tea leaves in an image stream.

Page 3: Deep-Learning Image Segmentation

First of all, I would like to thank everyone who helped make this research projecta reality. To the HES-SO in Switzerland and Zhejiang University in China, toProfessor Martial Geiser who established this inter-university exchange program,to Professor Ran Lixin who offered this research opportunity, to Professor ChenPing who provided information about tea plantations and to my dear wife andchild who were supportive in this adventure.

A special thank you to Yi ”William” Zhang who was very helpful in guidingus through the various formalities required to establish ourselves temporarily inHangzhou.

Last but not least, thank you to all the academic researchers and contributors ofthe great open-source software, whose work is used throughout this project.

Page 4: Deep-Learning Image Segmentation

Table of Contents

1 Introduction 61.1 A Brief History of Tea . . . . . . . . . . . . . . . . . . . . . . . . 71.2 The Tea Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Tea Plantations in Zhejiang Province . . . . . . . . . . . . . . . . 81.4 Tea Harvesting in China . . . . . . . . . . . . . . . . . . . . . . . 91.5 Green Tea Manufacturing Process . . . . . . . . . . . . . . . . . . 9

2 Environmental Characteristics 112.1 Geography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Weather in Hangzhou . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Plantation Topologies . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Problem Definition 143.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Known and Potential Solutions . . . . . . . . . . . . . . . . . . . 15

3.2.1 Labor Sourcing and Incentives . . . . . . . . . . . . . . . . 153.2.2 Plucking Machines . . . . . . . . . . . . . . . . . . . . . . 163.2.3 Pruning Logistics . . . . . . . . . . . . . . . . . . . . . . . 163.2.4 Leaf Sorting Machines . . . . . . . . . . . . . . . . . . . . 16

3.3 Selected Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.5 Scope of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Dataset Creation 194.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Pictures Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Page 5: Deep-Learning Image Segmentation

4.2.2 Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4 3D Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.4.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.2 OpenMVG and OpenMVS . . . . . . . . . . . . . . . . . . 264.4.3 Meshlab and CloudCompare . . . . . . . . . . . . . . . . . 274.4.4 MVE and TSR . . . . . . . . . . . . . . . . . . . . . . . . 274.4.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5 3D Tagging Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.5.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 304.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.6 2D Tagging Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.6.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.6.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 374.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.7 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.7.1 Mask Cleaning . . . . . . . . . . . . . . . . . . . . . . . . 394.7.2 Per-pixel Weighting . . . . . . . . . . . . . . . . . . . . . 394.7.3 Geometric Variants . . . . . . . . . . . . . . . . . . . . . . 394.7.4 Photometric Variants . . . . . . . . . . . . . . . . . . . . . 404.7.5 Training, Test and Evaluation Sets . . . . . . . . . . . . . 404.7.6 Building Final Datasets . . . . . . . . . . . . . . . . . . . 40

4.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Machine-Learning Applied to Computer Vision 445.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Machine-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . 455.2.2 DeepLearning . . . . . . . . . . . . . . . . . . . . . . . . . 465.2.3 Convolutional Neural Networks . . . . . . . . . . . . . . . 465.2.4 Fully Convolutional Neural Networks . . . . . . . . . . . . 495.2.5 Pre-trained Layers . . . . . . . . . . . . . . . . . . . . . . 49

Page 6: Deep-Learning Image Segmentation

5.2.6 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2.7 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.2.8 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2.9 DARTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Model Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . . 525.3.1 Batch Size, Steps and Epochs . . . . . . . . . . . . . . . . 525.3.2 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . 525.3.3 Weighted Per-pixel Loss . . . . . . . . . . . . . . . . . . . 535.3.4 Learning Schedules . . . . . . . . . . . . . . . . . . . . . . 535.3.5 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . 565.3.6 Activation Functions . . . . . . . . . . . . . . . . . . . . . 565.3.7 Kernel Initializers . . . . . . . . . . . . . . . . . . . . . . . 595.3.8 Kernel and Gradient Regularizers . . . . . . . . . . . . . . 605.3.9 Local Response Normalization . . . . . . . . . . . . . . . . 615.3.10 Batch Normalization . . . . . . . . . . . . . . . . . . . . . 615.3.11 Dropout Rate . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.4.1 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . 625.4.2 Topology Template . . . . . . . . . . . . . . . . . . . . . . 645.4.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.4.4 Training and Validation . . . . . . . . . . . . . . . . . . . 665.4.5 Scaling-up Using GPUs . . . . . . . . . . . . . . . . . . . 67

5.5 Model Architectures and Parameters . . . . . . . . . . . . . . . . 695.5.1 High-level TensorFlow Graph . . . . . . . . . . . . . . . . 695.5.2 Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . 705.5.3 Evaluated Topologies . . . . . . . . . . . . . . . . . . . . . 72

5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.6.1 Tested Hyper-Parameters . . . . . . . . . . . . . . . . . . 745.6.2 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.6.3 Flower Dataset . . . . . . . . . . . . . . . . . . . . . . . . 755.6.4 Tea Leaves Dataset . . . . . . . . . . . . . . . . . . . . . . 77

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6 Future Work 91

Page 7: Deep-Learning Image Segmentation

Appendices 92OpenGL Shaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92MVE Stability Patch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92Topology Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93Training Parameters Format . . . . . . . . . . . . . . . . . . . . . . . . 94Official Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Official Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

List of Tables 98

List of Figures 99

Bibliography 102

Page 8: Deep-Learning Image Segmentation

Chapter 1

Introduction

This first chapter presents some general knowledge about tea production andhistory.

Figure 1.1: Camellia Sinensis (source: Wikipedia).

A. Ducommun dit Boudry 6

Page 9: Deep-Learning Image Segmentation

CHAPTER 1. INTRODUCTION

1.1 A Brief History of Tea

Tea plants or bushes are all from the same species Camellia Sinensis, declinedin three varieties, namely sinensis (small leaves), cambodi (medium leaves) andassamica (big leaves) [1, 2]. Thousands of subvarieties and cultivars exist today.It is commonly found in tropical and subtropical climates and tea plants of thefinest quality are growing at an altitude of up to 1500 meters. While the leavescan measure up to 15 centimeters long, the Chinese variety tends to have smallermature leaves up to four centimeters [3].At the time of writing, earliest physical evidence of tea preparation and consump-tion seems to be around the second century BC, in China. On the other hand,myths and legends attribute this invention to a period somewhere between 2000and 3000 years BC. Records show that tea was cultivated and prepared as a med-ical treatment during the Han Dynasty (202 BC - 220 AD). Then, new processingtechniques gradually emerged to change the flavors and store tea longer, for in-stance: steaming, brewing, drying, fermenting and grinding tea as a powder [4].By the 8th century, tea was also prepared and drunk for its refined taste andfragrance. This beverage began spreading to other countries, such as Japan,Korea and Vietnam. Similar to other rare and refined item, like coffee in Europe,tea drinking became a rite, associated with numerous virtues, ceremonies andlegends. Up to this day, the finest teas are brewed and served in very specializedteawares by a skillful master [5].

1.2 The Tea Industry

Tea drinks are very popular world-wide, ranking second only to water [6]. Accord-ingly, the tea market is considerably large, estimated to be worth between 10 to40 billions USD in 2016. It is mostly grown commercially and at scale in China,India, Kenya, Sri Lanka and Turkey. Green, Yellow, White and Oolong teas arecurrently the most sought after and therefore lucrative tea products [7, 8].Similarly to wine production, the soil, plant variety or cultivar, harvest and pro-duction methodology, among other factors, play a great role in the look, flavorand fragrance of the final product. For example Chinese and Japanese green teasare nothing alike.In China, the following types of tea are produced from freshly plucked leaves andbuds. It is interesting to note that it is the only country in the world producingthe complete range of six kinds of tea listed here [1].

Green (绿茶): sun/tumble/oven-dried or basket/pan-fired or steamed.

Yellow (黄茶): similar to green tea with additional steaming to modify flavor.

White (白茶): made of the youngest/freshest leaves, minimal processing, couldbe lightly oxidized (less than 10%).

A. Ducommun dit Boudry 7

Page 10: Deep-Learning Image Segmentation

CHAPTER 1. INTRODUCTION

Oolong (乌龙茶): partially oxidized (15% to 85%), most complicated manufac-ture.

Black or Red (紅茶): fully oxidized (100%).

Pu-erh (黑茶): fully oxidized and fermented, can be aged and kept long time instorage.

Figure 1.2: Chinese tea products, from left-to-right: white, green and black tea(source: Wikipedia).

1.3 Tea Plantations in Zhejiang Province

Near the city of Hangzhou in Zhejiang, China, can be found the famous plan-tations of Longjing (龙井茶, literally dragon well tea). Plants are cultivated onthe slope of small mountains surrounding Longjing village. The harvest is per-formed by hand and the green tea is pan-roasted, also manually for the highestquality variants. The harvest period usually spans only from March until end ofApril. The harvest is divided between before and after Qingming festival (清明节, April’s 4th, 5th or 6th depending on the year). The former being of highestquality and therefore more sought after. The bushes are then cut down drasti-cally and parasite treatments applied in June. The trees grow until October untilthey are pruned again before the Winter season. There are also numerous similarexploitations in the surrounding area, some of which are working under longerand multiple harvesting periods.About two hours away from Hangzhou, still in Zhejiang province, is located an-other famous mountain green tea called Anji bai cha (安吉白茶). Lastly, onenotable green tea variation produced in the same province is the Gunpowder tea(珠茶, literally pearl leaf tea). Many other producers can be found in the provinceand in fact, most teas produced in Zhejiang are of the green variety [9].Green teas are also produced in other Chinese provinces and in foreign countries,with varying quality, pricing and methodology.

A. Ducommun dit Boudry 8

Page 11: Deep-Learning Image Segmentation

CHAPTER 1. INTRODUCTION

1.4 Tea Harvesting in China

Tea flushes are plucked by hand in the following ways:

One bud only: rare, for special teas (green, yellow, white, black).

One bud and one leaf: sometimes, for high quality teas (green, yellow, white).

One bud and two leaves: frequent, for standard teas.

Up to four leaves on a stem: frequent, reserved for oolong and black teas.

When making green tea, the standard is one bud, one leaf - also called theimperial cut - in early spring for the highest quality harvest, and one bud, twoleaves during the rest of the year.By contrast, this task is performed mostly with machines in Japan and Korea,increasing efficiency although the resulting leaves mixture is irregular and wouldbe regarded as a low-quality harvest in China. By some estimates, a worker usingsuch a machine in Japan can cut up to 130 kilograms per day, while a person inChina can gather at most 13 kilograms daily by hand-plucking [1].

1.5 Green Tea Manufacturing Process

After the freshly harvested leaves have reached the tea factory or artisan, theyare exposed to air for a short while in large flat baskets or straw mats. Once theyare deemed ready for the next phase by the tea master, the leaves are processedin a way to reduce their water content and keep all the aroma inside the leaves.In Longjing and surrounding area, this is done manually by compressing themgently in a large electric pan, until the leaves are flattened down and tightlypacked. In other places, the tea leaves could simply be dried without furtherprocessing in terms of shape or they could undergo a longer sequence of manip-ulations.Green tea manufacturing outside of Chinese mainland can be quite different thanthe process listed above. In particular, Japanese green teas are first steamedbefore further processing.

A. Ducommun dit Boudry 9

Page 12: Deep-Learning Image Segmentation

CHAPTER 1. INTRODUCTION

Figure 1.3: Freshly plucked leaves air-drying on a straw mat.

Figure 1.4: Longjing pan-firing equip-ment.

Figure 1.5: Longjing green tea in its fi-nal state.

A. Ducommun dit Boudry 10

Page 13: Deep-Learning Image Segmentation

Chapter 2

Environmental Characteristics

In the next sections can be found a few notes about environmental factors toconsider when designing a future machine capable of functioning in tea fields.

Figure 2.1: Typical tea plantation near Hangzhou, Zhejiang, China.

A. Ducommun dit Boudry 11

Page 14: Deep-Learning Image Segmentation

CHAPTER 2. ENVIRONMENTAL CHARACTERISTICS

2.1 Geography

Camellia Sinensis can be found up to 1500 meters above sea level in tropical andsub-tropical regions of the world. Tea fields are grown on mainly two kinds ofsurface: flat patches of land and on flanks of mountain. The general consensusis that higher altitude plantations, either on plateaus, terraces or on slopes areproducing finer raw material - namely tea leaves - which can be processed intohigher quality products. The main reason being that the average temperature islower and therefore the leaf growth slower, retaining more flavor over time.The optimal soil must be acidic and tea plants require at least 127 centimetersof rain annually.In Hangzhou, the plantations are found in plains, valleys and on slopes of nearbymountains, up to 300 meters above sea level.

2.2 Weather in Hangzhou

The local temperature goes from zero degree Celsius in winter to close to forty insummer. The average air humidity is around 75% and total yearly precipitationsup to 150 [cm] [10]. There is a peak of rainwater during the month of June andgenerous albeit short rainfalls throughout the summer. These are close to idealweather conditions to grow Camellia Sinensis.The sky is usually clear and the sun shines strongly from about 6 AM to 6 PMdaily (spring to autumn). A future harvesting machine could benefit from photo-voltaic panels providing some of the power required to function.

2.3 Plantation Topologies

The plants can reach 16 meters, but they are pruned down to about one meterin practice to facilitate harvest. They are planted in three majors layouts:

A Straight and regular rows up to 1.5 [m] wide, on large flat grounds: highlysuitable for large scale automation.

B Curved and irregular rows up to 1.5 [m] wide, on terraces, smaller areas oron steep mountain slopes: pose serious challenge to automation.

C Circular bushes up to 1 [m] in diameter, usually not planted in a particularpattern, mostly in mountains area: very difficult for automation.

In Longjing (Hangzhou) and at other small-scale artisanal tea producers, largearea are divided into small irregular parcels owned by different families. The twokinds of plantations observed in this region were B and C cases described above.

A. Ducommun dit Boudry 12

Page 15: Deep-Learning Image Segmentation

CHAPTER 2. ENVIRONMENTAL CHARACTERISTICS

Figure 2.2: Plantation in rows.

Figure 2.3: Plantation in circular bushes.

A. Ducommun dit Boudry 13

Page 16: Deep-Learning Image Segmentation

Chapter 3

Problem Definition

Why would there be a need for more automation in the Chinese tea industry?What are the existing and potential solutions? What a contemporary roboticmachine could reasonably achieve in a tea plantation? These are some of thequestions explored in this chapter.

Figure 3.1: A mechanical harvesting machine in Japan (source: discover-ingtea.com).

A. Ducommun dit Boudry 14

Page 17: Deep-Learning Image Segmentation

CHAPTER 3. PROBLEM DEFINITION

3.1 Motivation

China is undergoing a massive urbanization for decades and this trend is forecastto continue. At the end of 2016, about two thirds (67%) of the population inZhejiang province is living in cities [11]. By comparison, only 19% in averagepopulation were urban residents across China in 1980 [12]. This ratio in Zhejiangprovince is expected to grow to 80% by the year 2030 [13]. The advantages ofa modern life for city dwellers mean that traditional rural areas are getting lessand less populated, especially by younger generations.At the same time, a higher fraction of Chinese citizen is pursuing longer stud-ies with over 87% finishing high-school, of which around 60% are pursuing auniversity-level diploma [11, 14]. This highly qualified and specialized workforceaims to find matching job opportunities in the secondary or tertiary sectors ofthe industry, with higher demands in terms of wage level, working conditions andrequired expertise.Additionally, tea harvesting cannot be exercised as a full-time job throughout theyear. It is a punctual activity with peak demand in spring and autumn.These three factors, among others, are putting pressure on tea manufacturers byincreasing labor costs. Indeed, according to a few Zhejiang tea producers, it isgetting more difficult to find suitable and affordable workers near them to harvesttheir fields. The long and difficult working conditions, including heavy sunshine,high temperatures up to 40 [°C] and frequent mosquito bites, are discouragingpotential candidates who can find better opportunities.An equilibrium must be found between the costs inherent to tea production andthe revenue generated, namely: what sort of tea, at which quality it is beingproduced, and what consumers are willing to pay for.The dream of Chinese tea producers - especially tea artisans - is to be able tocontinue the manufacture of traditional tea products and offer them to the localpopulation and the rest of the world. At the same time, they would like to do sowith controlled economical risks and with the highest profit margin as possible.These market conditions allow for innovative solutions to be explored and putinto practice. In particular - as it is the trend in precision farming around theworld - there is an opportunity to use advanced technology to more intelligentlymonitor, mend and harvest tea plantations.

3.2 Known and Potential Solutions

3.2.1 Labor Sourcing and Incentives

One option to alleviate the worker shortage is to source them from rural or poorerareas, eventually from other provinces. In this case, transportation, living quar-ters and food supply need to be organized and maintained during the harvesting

A. Ducommun dit Boudry 15

Page 18: Deep-Learning Image Segmentation

CHAPTER 3. PROBLEM DEFINITION

season. That means increased operational issues to handle and could only func-tion as long as it is cost effective to do so.Another solution is to simply increase wages or benefits of existing workers tokeep them in the pool of available candidates when the harvesting period comes.As stated above, this will reduce profit margins or require an increase in wholesaleprice, both of which need to be carefully evaluated, planed and executed.

3.2.2 Plucking Machines

Tea plucking machines exist already on the market, as used frequently in othercountries. These machines are built around a simple rotative set of knives, in-cluding a mechanical cutting-depth control. Two workers are required to handlethe machine and slowly walk over the row of tea bushes, collecting freshly cutmaterial in a bag.In China - as explained in chapter one - tea manufacturing processes require care-fully selected tea leaves and leaf buds. Hence tea plucking is still done mostly byhand. The machine described above is simply not selective enough and potentiallydamaging the old and precious tea trees.In theory, a plucking machine could be designed in such a way that it is capableof selecting precisely which leaves and leaf buds to pick.

3.2.3 Pruning Logistics

There exist a special plant pruning technique for easier harvest. The trees arecut in such a way that all the fresh and young leaves will grow from the samelevel or height. This process is normally done in autumn and the benefits can beexploited during the spring harvest season.As all the leaves and leaf buds are then found on the same level, it is much fasterto harvest than on trees with organic leaf distributions. It could eventually becombined with a simple plucking machine with acceptable results. The downsideis that this process can be used only once or twice in a year.

3.2.4 Leaf Sorting Machines

If the harvested material is below the required standard due for instance to theusage of a plucking machine, a way to tackle the problem could be to sort it outquickly on the factory floor. This can be done by hand or by a sorting machineaided by an advanced computer-vision system.As tea leaves start to oxidize when they are in contact with air and even more ifthey are bruised, this process needs to be quick and gentle enough to maintainoxidation and structural leaf damage to minimal levels, as required for green teamanufacturing.

A. Ducommun dit Boudry 16

Page 19: Deep-Learning Image Segmentation

CHAPTER 3. PROBLEM DEFINITION

3.3 Selected Approach

The envisioned approach in this research is to enhance the capabilities of a pluck-ing machine up to the point it can perform on par with human-level picking abil-ity. It is a long-term research objective including many necessary componentsthat need to be developed.Firstly, the machine must be able to scan and analyze tea plants to find suitableleaves and leaf buds to harvest. That is the scope of this present research.Secondly, a mechanical prototype needs to be manufactured, including the computer-vision equipment, a cutting mechanism and leaf collection system.Thirdly, the criteria of leaf selection must be configurable so that a tea mastercan define what kind of pick will be performed on any single day: bud-only, onebud plus one leaf, etc.Then comes the design of a robust machine capable of working in the harshexterior conditions. For flat and orderly plantations, a robot on wheels couldmove by itself and harvest fields mostly autonomously. In mountain area, a lightrobot on rails up to a few meters in length could be positioned above rows of teaplants and pick autonomously. Human operators would move the robot once thecurrent area is harvested. Another option in slopes would be to build a monorailsystem providing access to each tea plant row: a robot operating on the monorailcould then move across the plantation and use an arm to pick leaves.Finally, the performance of the machine should be improved over time in a semi-automated fashion. That means a feedback could be given and computer-visionmodels refined over time. One way to achieve this in practice could be to pickidentifiable samples and store them separately from the main harvest material.A human operator could then provide feedback on such samples. The feedbackswould then be integrated in a newer leaf detection model and increase the qualityof future harvests.

3.4 Previous Work

First and foremost, in a research in Chinese entitled ”Researches on High-qualityTea Flushes Identification for Mechanical-plucking” [15], the approach of usingstandard computer vision algorithms is studied. In particular, the effect of variousmathematical operations on the RGB channels and other color spaces: noiseremoval, contrast enhancement, mathematical morphology, region growing andso on. These are traditional algorithms in the toolset available in the field ofcomputer vision. The results are interesting but not applicable to the generalcase of tea plucking in a natural environment.”Developing Situations of Tea Plucking Machine” [16] is a study which reviewedthe existing tea plucking machines and their usage around the world. This com-parative study is focused on the business and social aspects and lists several

A. Ducommun dit Boudry 17

Page 20: Deep-Learning Image Segmentation

CHAPTER 3. PROBLEM DEFINITION

prototypes being developed in China. The machines discussed are applicableonly to large-scale flat plantations.

Figure 3.2: Four kind of existing machines (source: [16]).

A team of engineers in Nanjing, China, built a prototype using a robotic armand sliding rails to pick tea leaves. Their findings are described in ”Research ona Parallel Robot for Green Tea Flushes Plucking” [17].In ”Design and Development of Selective Tea Leaf Plucking Robot” [18], a teamof Bharathiar university in India designed a prototype for a robotic arm. Thestudy focused on the electrical and mechanical aspects and concludes that it isvery challenging to solve the problem of harvesting tea leaves.So far, no known solution can tackle the task on the same quality level as humanworkers, especially in mountainous regions.Multiple researches have been done on similar topics, such as 3D reconstructionof vegetation [19, 20, 21, 22], leaf segmentation [23] or leaf counting [24].

3.5 Scope of Research

In this research, advanced computer-vision algorithms will be explored to performtea plant image segmentation: for each input pixel at coordinates (u, v) in thecaptured picture, the software will try to predict if the pixel belongs to a tea leafto pick or not.

A. Ducommun dit Boudry 18

Page 21: Deep-Learning Image Segmentation

Chapter 4

Dataset Creation

A tea leaf segmentation dataset contains mainly two kind of data: pictures oftea plants taken in plantations and their associated segmentations, highlightingleaves that should be picked up by a machine.No pre-exising dataset was found, hence a complete process of creating a datasetfrom scratch is described in the current chapter.

Figure 4.1: Segmentation dataset

A. Ducommun dit Boudry 19

Page 22: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

4.1 Methodology

The first step is to find suitable locations of tea plantation to take pictures. Asdata labeling is a time consuming task and can be a source of error from pictureto picture, an idea is to try using a 3d point-cloud reconstruction software. Suchan approach could be used to manually label points in a 3d scene and projectthese on original pictures. Hence the labels can be specified once by hand for thewhole picture dataset. A 3d reconstruction software needs to be selected and thephoto camera intrinsics calibrated using a printed pattern. This is not strictlynecessary but will improve the stability during 3d reconstruction.Once the pipeline for 3d reconstruction is operational, the next step is to do thetea plant pictures acquisition on a larger scale. A 3d point-cloud tagging softwarewill be implemented to visualize the resulting 3d reconstruction, define the labelsand project them back on each picture separately.As the quality of final labels was not fully satisfactory, a second software tool willbe implemented to refine the labels in two dimensions.Additional algorithmic post-processing will then be put in place to remove anyremaining noise and build per-pixel weighting associated with the labels. Theweighting will be particularly important for the machine-learning training phasebecause very few positive samples are labeled with respect to negative samples.Finally datasets are created using the original pictures, labels and weighting.They will be written to a binary file. As the source pictures are fairly large(around 20 mega-pixels) and because the machine-learning training phase mightneed many thousands of samples, they will be geometrically and optically trans-formed into many more variants at a final resolution of 256x256 pixels.

4.2 Pictures Acquisition

4.2.1 Hardware

The photo camera is a Sony RX100-II equipped with a 28-100mm f/1.8-4.9 lens(35mm equivalent). It produces pictures at a resolution of 5472x3648 pixels.The CCD sensor is a one inch Exmor-R BSI-CMOS sensor (13.2 x 8.8mm, 3:2aspect ratio).

4.2.2 Locations

The following locations were explored and used during this phase of the project:

• Hangzhou botanical garden: this location was used to acquire a first picturedataset to test 3d reconstruction (ferns).

A. Ducommun dit Boudry 20

Page 23: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

• Longjing dragon well village: an ancient and famous tea plantation ex-ploited by many local families.

• South of Yuquan campus: a small plantation closer to the city.

• Guangmingsi Shuiku: a tea plantation further away from Hangzhou citycenter.

• Huajiachi campus: an experimental tea research plantation owned by Zhe-jiang University.

As the commercial tea plantations are pruned almost down to the ground startingmid-April - only two weeks after the beginning of this research - the main sourceof pictures was the research plantation in Huajiachi campus. Unfortunately, itmeans that the overall set of images captured was not representative of how thegreen tea plantations look like in spring.

4.3 Camera Calibration

A typical camera projects points in 3d space onto a 2d plane. The simplest modelis the pinhole camera and can be represented mathematically by a transforma-tion from 3D homogeneous coordinates to 2D homogeneous coordinates using aprojection matrix (3x4) [25, 26]:

P = K · [ R t ] =

fx s cx0 fy cy0 0 1

r1,1 r1,2 r1,3 txr2,1 r2,2 r2,3 tyr3,1 r3,2 r3,3 tz

(4.1)

Given a point q in 3d space represented by its homogeneous Cartesian coordinates,it can be projected on 2d picture plane as q′ using the above projection matrixP :

q′ = Pq where q′ =

uvw

, q =

xyz1

(4.2)

As it can be seen above, the projection matrix P can be decomposed in two kindsof information: the intrinsic K and extrinsic [ R t ] parameters. Intrinsic pa-rameters are dependent on the camera and assumed constants and include mainlythe focal length (fx, fy), the principal point offset (cx, cy) and an optional axisskew (s). On top of that, more complex camera models can also account for lensand other optical deformations in the intrinsic parameters. On the other hand,the extrinsic parameters are independent of the camera hardware and define itspose in the world coordinate frame: a rotation matrix (R) and translation vector(t) which can be combined into a single matrix using homogeneous coordinates.

A. Ducommun dit Boudry 21

Page 24: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

Figure 4.2: Illustration of the pinhole camera model (source: OpenCV).

By splitting camera parameters into a static intrinsic part - assuming the camerahardware is not subject to environmental factors - and an extrinsic part whichis usually different for each picture taken, it becomes possible to solve themseparately more efficiently.A camera calibration procedure - namely finding the constant intrinsic parametersfor the given camera model - is using known visual patterns captured on severalpictures taken by the same camera but under varying poses. As the calibrationpattern is geometrically constant during such procedure, an algorithm can usethis fact to solve for the focal length and principal point coordinates (in pixels),as well as the optical deformation parameters.The open-source computer vision library OpenCV [27] offers such an implementa-tion of the algorithm, capable of estimating also radial and tangential distortions(see functions findChessboardCorners, cornerSubPix and calibrateCamera).In OpenCV, the radial k1, k2, k3 and tangential p1, p2 distortion coefficients arecorrecting for non-linear lens deformations as follow:

x′ = x(1 + k1r

2 + k2r4 + k3r

6 + 2p1xy + p2(r2 + 2x2)

)y′ = y

(1 + k1r

2 + k2r4 + k3r

6 + p1(r2 + 2y2) + 2p2xy

)where r =

∥∥∥∥[xy]−[cxcy

]∥∥∥∥2

(4.3)

Alternatively, a software suite such as MRPT includes a camera calibration toolthat can load pictures with the pattern and estimate the instrinsics parame-ters [28].

A. Ducommun dit Boudry 22

Page 25: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

To compute theoretical focal length in pixels, the following equation can be used:

fpx = max(W,H)fmm

ccdmm

(4.4)

where W,H are respectively the image width and height in pixels, fmm is thefocal length in millimeters and ccdmm is the image sensor size in millimeters.Similarly, the theoretical principal point coordinates in pixels are:

cx =W

2, cy =

H

2(4.5)

In a perfect camera, the axis skew factor s is set to zero. It is also often zero inreasonably good cameras, so most calibration procedures don’t account for thisparameter.

4.3.1 Results

Parameter Ideal value Calibratedfx 4311.30 4446.69fy 4311.30 4416.17cx 2736.00 2764.50cy 1824.00 1799.13s 0.00 0.00k1 0.00 3.20 · 10−2

k2 0.00 −2.21 · 10−2

k3 0.00 −1.59 · 10−1

p1 0.00 6.00 · 10−4

p2 0.00 4.39 · 10−3

Table 4.1: Ideal and calibrated intrinsics for the Sony RX100-II.

It turns out that the camera is already producing close to ideal pictures.

Figure 4.3: A picture of the calibration pattern taken with the Sony RX100-II.

A. Ducommun dit Boudry 23

Page 26: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

4.4 3D Reconstruction

If multiple overlapping views of the same scene are captured and assuming thatthe relative pose of each view with respect to each other is known, it is possibleto estimate the distance at which a point-of-interest or feature lies relatively tothe views by triangulation [25, 29]. In the case of a pair of views:

xL = PLXxR = PRX

(4.6)

where xL, xR are known 2d coordinates on the respective image planes, the twocamera matrices are PL, PR and the 3d coordinates X are unknown.

Figure 4.4: Epipolar line xReR (source: Wikipedia).

Additionally there is a relationship between xL and xR given by the fundamentalmatrix F satisfying:

xLFxR = 0 (4.7)

Fxi generates an epipolar line on which a matching point xj must lie on the otherview j. A fundamental matrix F is unique to a given pair of views i, j.These definitions require both the precise pose information of each view relativeto the others and uniquely identifiable features localized in each view (2d pixelcoordinates). Therefore algorithms were developed for the more general case:automatically identifying features among large set of pictures and estimating theposes and intrinsics of many views.In the first class of algorithm (2d features matching), many algorithms exist today.The objective is to find specific areas in the image which are identifiable in someway (e.g. by the gradients, colors or edges), then encode or compress these intofeature descriptors containing all the information necessary to find the same areain other images or locations.Ideally, such an algorithm must be capable of creating feature descriptors whichare invariant to rotation, scale, luminosity or other typical transformations and

A. Ducommun dit Boudry 24

Page 27: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

in a format which can be indexed for faster lookup on subsequent matching. Apopular albeit patented solution is SIFT [30], other commonly used solutions areSURF [31] or ORB [32].In the second class of algorithm - 3d reconstruction from multiple views, alsocalled 3d reconstruction from motion) - bundle adjustment (BA) is now widelyused. It can estimate iteratively and efficiently the individual camera poses jointlywith intrinsic parameters from a random collection of images by minimizing thereprojection error of that estimation [33, 25, 34]:

minn∑

i=1

m∑j=1

vijD(Q(Pj, Xi), xij)2 (4.8)

where the reprojection error to be minimized is computed as the sum of thesquared euclidean distancesD(a, b) between the reprojected coordinatesQ(Pj, Xi)and image plane coordinates xij for all n features and m views. The euclideandistance is defined as D(a, b) = ∥a−b∥2, and in its simplest form, the reprojectioncan be done as X ′

i = PjXi. The weighting vij is equal to one if the feature i is vis-ible on view j, zero otherwise. A sub-optimal but more scalable solution is calledincremental bundle adjustment, where only a subset of the views is inserted inthe optimizer at each iteration.Finding an efficient solution to this optimization problem is challenging and willnot be explored in details here.After the poses are estimated, a triangulation algorithm can be used to estimatethe depth at each point of the picture by matching pixels on pairs of images.This generates a depthmap for each view and it can be used to create a fusedand relatively dense 3d point-cloud by combining all the pose/depth informationavailable.If the reconstructed scene is comparatively simple (mostly convex hulls and simplevolumes like buildings or cityscape), it is possible to estimate and define thetextured surface of the object represented by the points, for instance by stitchingtriangles together. This process is already used with some success in the gamingindustry [35] among other use-cases. Academic researches are exploring ways tomake it more robust and applicable to other type of volumes. Unfortunately, itis still experimental and as of May 2018, no solution could produce robust resultsfor organic and thin volumes such as leaves of plants.

A. Ducommun dit Boudry 25

Page 28: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

The general processing pipeline in a complete 3d reconstruction software suite isas follow:

1. Undistorting raw images with pre-calibrated camera parameters (optionalstep).

2. Indexing 2d features found on each image using a feature detector.

3. Matching 2d features across the picture set to find potential pairs of imageswith overlap.

4. Selecting a small subset of the pictures as a starting point (at least one pairof image with some overlap) and initializing poses and intrinsics.

5. Computing the reprojection error of the current set of poses.

6. Updating the pose and intrinsics of each view with respect to the reprojec-tion error.

7. Including another small subset of the remaining pictures and continuing theerror computation / pose refinement iterations until all views are processed.

8. Once all view poses are defined, generating a depthmap using stereovisionor the more general N-view geometry.

9. Constructing a point-cloud from individual depthmaps.

10. Reconstructing surfaces delimited by the point-cloud.

4.4.1 Hardware

The complete process from 2d pictures to 3d point-clouds requires a fair amount ofmemory, storage and computing resources. A high-performance server equippedwith two Intel Xeon at 2.5Ghz (20 cores in total) and 128GB RAM was used toreconstruct the biggest scenes.

4.4.2 OpenMVG and OpenMVS

OpenMVG can produce 3d point-clouds and OpenMVS can post-process theminto surfaces and volumes [36]. They are interesting tools because they includemany user-customizable parameters, user-selectable algorithms and because theycan be extended with new experimental algorithms. They are a great basis foracademic research in this field. In practice, they were not producing good resultson the biggest tea plant datasets (over 500 images) and the documentation toimprove them was lacking.

A. Ducommun dit Boudry 26

Page 29: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

Figure 4.5: Reconstructed 3D point-cloud from 51 individual pictures using Open-MVG: fern in Hangzhou botanical garden (5.8 millions points).

4.4.3 Meshlab and CloudCompare

Once the 3d point-clouds are computed, they can be visualized either in Cloud-Compare [37] or Meshlab [38]. These tools can also apply various operations onpoint-clouds to do post-processing, among which are alignment, noise removal orsurface reconstruction.

4.4.4 MVE and TSR

MVE is a 3d reconstruction software suite developed and maintained by a teamat TU-Darmstadt [39, 40]. These tools were applicable to the biggest dataset(550 images) though it was necessary to write software patches to fix a numericalequation solving stability issue (see MVE Stability Patch) and add the ability toinitialize the calibrated camera intrinsics.TSR is an experimental algorithm to generate thin surfaces present in 3d point-clouds [41]. It was unfortunately very difficult to build and was not producingusable results. Therefore it was discarded in this approach.All the datasets could be processed using MVE. Some views were discarded bythe bundle adjustment algorithm because they couldn’t be fused with sufficientaccuracy, mainly due to different lighting conditions during image capture or notenough overlap with the other set. Some views were also manually discarded be-cause they were obviously not correctly reconstructed, mainly due to ambiguitiesin the features available in the images.

A. Ducommun dit Boudry 27

Page 30: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

Figure 4.6: Reconstructed 3D point-cloud from 353 individual pictures usingMVE: Tea plantation in Hangzhou Huajiachi campus (180.8 millions points).

4.4.5 Remarks

The scale and pose (origin and absolute orientation) of the reconstruction areambiguous and need to be manually adjusted if precise physical measurementsare necessary. A typical solution is to capture an object with known dimensionswithin the scene to rescale the point-cloud later. The orientation and origin canbe manually set according to requirements.Lighting conditions can affect the performance of the feature detector and subse-quent pairwise matching and casted shadows must be avoided. It is recommendedto make the acquisition on a cloudy day with uniform ambient lighting.The coverage of the target area and therefore overlap among pictures need to behigh and depends on the type of scene.A good camera equipped with a large low-noise image sensor and high perfor-mance lens is going to produce higher precision point-clouds.

A. Ducommun dit Boudry 28

Page 31: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

4.5 3D Tagging Tool

The tool to be developed will have the following features:

• Import MVE’s outputs listed below.

• Render each camera pose, also called view in MVE terminology.

• Render 3d point-cloud with up to 1 billion points in near real-time.

• Render the scene from user-defined point-of-view and from the views them-selves.

• Project the original picture to verify alignment on each view.

• Allow to discard invalid or inaccurate views.

• Allow to select/unselect points.

• Flexible rendering using shaders.

• Render to files.

From the outputs generated by MVE, the following will by used as inputs to thistool:

• For each view, the camera intrinsics and extrinsics written to an INI file.

• The set of 3d points in PLY (Polygon File Format from Stanford Univer-sity).

• The undistorted images as PNG (Portable Network Graphics).

The outputs of this tool will be two kind of images:

• A mask or label associated with each view, indicating the area containingpotential tea leaves candidates in white and the non-interesting areas inblack.

• The depth buffer, relative to the camera.

4.5.1 Hardware

A workstation equipped with 16 GB of RAM memory and a GPU (Nvidia QuadroK1100M).

A. Ducommun dit Boudry 29

Page 32: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

4.5.2 Implementation

The software is implemented using Python 3 as the programming language,NumPy for numerical computations, Qt5 as desktop widget toolkit accessedthrough Pyside2 Python/C++ bindings and is relying on OpenGL to renderthe scene in 3D.MVE stores each view meta-data, images and point-cloud in its own folder. Itcontains:

Filename Descriptionmeta.ini view’s meta-data in a format similar to INI filesoriginal.jpg unaltered source imageundistorted.png image rectified according to camera intrinsicsdepth-Ln.mvei computed 16-bits precision depthmap for the viewpointcloud.ply 3d vertices in world coordinates computed from depthmap

and camera parameters

Table 4.2: MVE output files

Additionally, MVE writes two global files containing the feature descriptors andview pairings. These data are not used in this tool.The camera intrinsics and extrinsics for each view are stored in meta.ini:

Parameter Variable Type Descriptionfocal_length f scalar normalized focal lengthprincipal_point c 2d vector normalized principal point coordi-

natespixel_aspect α scalar pixel’s width/height ratiodistortion k1, k2 two scalars radial distortion coefficientsrotation R 3x3 matrix camera’s rotationtranslation t 3d vector camera’s translation

Table 4.3: MVE view meta-data

The undistorted image can be computed by mapping normalized pixel coordinates(x, y) as follow, though it is already precomputed by MVE in undistorted.png:

x′ = x(1 + k1r2 + k2r

4)

y′ = y(1 + k1r2 + k2r

4)

where r =

∥∥∥∥[xy]− c

∥∥∥∥2

(4.9)

A. Ducommun dit Boudry 30

Page 33: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

The camera’s position q relative to the world’s coordinate frame can be found asfollow:

q = −RT t (4.10)

RT denotes the transpose of R and is equals to its inverse transformation (RTR =I), because R is a 3x3 rotation matrix. The inverse of t is simply −t.To render a 3d scene from the point-of-view of a camera, the absolute worldcoordinates must be transformed to bring the origin at the camera position andthen projected onto the screen 2d surface.

Figure 4.7: OpenGL coordinate system (source: NTU).

Firstly, to model a simple pinhole camera, an OpenGL perspective projectionmatrix mapping homogeneous 3d world coordinates to 2d normalized coordinatesin the range [−1, 1] with the center of the rendered surface at (0, 0) can be definedfor each MVE view as [42]:

Mprojection =

2fx 0 cx − 0.5 00 2fy cy − 0.5 0

0 0znear+zfarznear−zfar

2 ∗ znearzfarznear−zfar

0 0 −1 0

where

fx =

{f/β if β < 1.0

f otherwise

fy =

{f if β < 1.0

f ∗ β otherwise

β = αWH

znear ̸= zfar

(4.11)

znear and zfar are arbitrary distances to the near and far clipping planes and βis the rendering surface’s width over height ratio taking into account the pixel

A. Ducommun dit Boudry 31

Page 34: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

aspect from MVE. The factor M4,3 = −1 is there to invert the z-axis as is theconvention in OpenGL (positive Z axis points towards the screen).Secondly, an OpenGL view matrix to render from the point-of-view of the camerais defined from MVE’s output as:

Mview =

R11 R12 R13 tx−R21 −R22 −R23 ty−R31 −R32 −R33 tz0 0 0 1

(4.12)

because the OpenGL axis convention (right-handed) given below is different fromMVE’s output, the rotation matrix R from MVE needs to be sign adjusted asdescribed above.

• X is horizontal and positive in the right direction

• Y is vertical and positive in the up direction

• Z represents depth and positive in the camera direction (towards the screen)

From there, an arbitrary point p in 3d world coordinates can be projected tonormalized 2d screen coordinates q2d [43]:

q = (MprojectionMview)p where p, q ∈ R4

q2d =

0.5( qx

qw+ 1)

−0.5( qyqw

+ 1)

0.5( qzqw

+ 1)

qw

(4.13)

The resulting x and y coordinates map to the rendering surface space clippedin the range [0, 1] respectively from left to right and top to bottom. The zcoordinate is the normalized depth at that point which is written to the OpenGLdepth buffer, and w is discarded.The inverse transformation - namely unprojecting a 2d point q in screen coordi-nates to 3d world coordinates p3d. It is necessary when clicking with the mouse tofind the 3d coordinates it represents on the screen. It can be done as follow [44]:

p = (MprojectionMview)−1

2qx − 12qy − 12qz − 1

qw

where p, q ∈ R4

p3d =ppw

(4.14)

A. Ducommun dit Boudry 32

Page 35: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

This means that a particular qz and qw need to be chosen to yield a unique set of3d coordinates: qz can be found in the OpenGL depth-buffer and qw can simplybe set to 1.0. Without a specific qz, the mapping would yield a line in 3d spaceon which the point must lie.As the point-clouds can have millions of points, it would be extremely inefficientto simply iterate over them to find which points are near user-clicked coordinates.A solution to this issue is to index the points in a specialized data-structure. AK-d tree is used in this tool to query the entire point-cloud for matching pointsxi at a radius distance r from the user clicked coordinates. To achieve this, thedata structure sub-divides the space recursively into binary regions separated bya plane. The terminal condition is the minimum number of points in a givenregion. It can then be efficiently queried to retrieve the set of points S within aspecified radius from clicked coordinates p:

S = {xi | ∥p− xi∥2 ≤ r} (4.15)

Figure 4.8: Illustration of a K-d tree (source: Wikipedia).

Using modern OpenGL means that the fixed transform pipe-line is not available.Instead, small functions called shaders are running in the graphic card to per-form the necessary computations. These functions can receive fixed or varyingparameters at runtime, such as a projection matrix, an RGBA color or a vertexarray. In this tool, two kinds of shaders are used: vertex and fragment shaders.The former receives a 3d vertex and outputs it in normalized screen coordinates.The later is called each time a pixel needs to be modified on screen and can alterthe way this pixel is rendered, for instance by computing lighting equations ormixing colors sampled from a few textures (see OpenGL Shaders).A 3d scene graph abstraction is implemented to store the geometry and theirparameters and can render them in multiple passes. A typical scene contains:

A. Ducommun dit Boudry 33

Page 36: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

• A grid plane (XZ).

• An axis marker indicating the world’s origin and coordinate frame (XYZ).

• A point and an axis marker for each camera representing the pose of thatview.

• A point-cloud object for each view, containing the 3d vertices reconstructedby MVE and also storing the user-selection as a separated attribute.

• A fullscreen quad (quadrilateral surface) on which a texture can be mappedto display the original picture taken by the camera.

• Various associated states: transformations, shaders, flags, colors and tex-tures.

Figure 4.9: An MVE scene loaded in the tool, viewed from a user-defined camerawith a perspective projection.

As it can be seen above, the tool offers the flexibility to modify the shadersand various parameters at runtime. The user can switch the camera mode bychoosing either: an orthographic projection, a perspective projection or the exactprojection from a MVE view. Each MVE camera location is represented by ayellow dot and its camera coordinate frame by a small set of lines pointing tothe respective positive direction of each axis (X is red, Y is green and Z is blue).Invalid or discarded cameras are rendered as a purple dot. As rendering billionsof points every frame was causing performance issues, the point-cloud density canbe adjusted by the user, effectively selecting a subset of the total points accordingto a random uniform distribution.

A. Ducommun dit Boudry 34

Page 37: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

4.5.3 Results

The process of tagging the points can be done in a short time as expected. Theapplication requires up to 12 GB of RAM after loading the 180 millions pointsand building K-d trees. The masks can be rendered to disk without difficultyand the depthmap can be exported too, which is then a valuable informationassociated with each picture.

Figure 4.10: Rendering from a specificMVE view.

Figure 4.11: Rendering a view withuser-selection highlighted in blue in thepoint-cloud.

Figure 4.12: Rendering the same viewwith the original image in overlay.

Figure 4.13: Using a shader to renderthe final mask. Note: colors have beeninverted in this document.

Below are some issues that were noticed for the specific case of tea bushes:

• The point-cloud is quite dense but holes are still visible through some leaves.A way to fix this is to increase the point size when rendering them.

• Not every visible leaves where reconstructed in the point-cloud. There aremainly two reasons. First, the picture overlap was not sufficient, the lightingconditions not ideal (sunshine and shadows) and the scene coverage notvaried enough. With this information in mind, more pictures would needto be taken, especially from many different angles and positions, maybeusing an artificial light source. Second, the scene boundaries will always be

A. Ducommun dit Boudry 35

Page 38: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

more sparse than the main area. This is a problem when generating thebinary mask because leaves in these surrounding areas cannot be tagged.But that can be solved by taking the depthmap as a reference to know whichpixels are covered by the point-clouds and which are not. The depthmapcould even be used as a weighting, where pixels with a depth close to thecamera are considered more accurate or important than the ones far away.

• There is too much noise in the 3d reconstruction, with some areas sufferingfrom what appears to be a precision of plus or minus one centimeter. Inother use-cases this could be perfectly fine, but tea leaves can be as smallas 1[cm] by 0.5[cm]. It could be due to movements in the leaves causedfor instance by wind or it could be caused by imprecisions during the 3dreconstruction process. This means the final masks are covering largerarea than the leaves. This could be partially fixed by post-processing of themasks with the use of mathematical morphological operators (e.g. erosion).More experiments should be done to find out the exact causes of the noise.

A. Ducommun dit Boudry 36

Page 39: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

4.6 2D Tagging Tool

As the 3d-tagging alone was not precise enough, a second tool is implementedwith following features:

• Loading undistorted pictures and their mask

• Displaying the pictures and masks as overlays

• Refining masks using a digital tablet and pen

• Exporting modified masks to file

4.6.1 Hardware

A workstation with a 30” screen and a tablet Gaomon 1060 Pro, 2048 levels ofpressure, A4 format.

4.6.2 Implementation

The software is implemented using Python 3 as the programming language andQt5 as desktop widget toolkit accessed through Pyside2 Python/C++ bindings.To enhance visual contrast, the mask mrgb is rendered as pink and blended withthe source image srgb using the following function:

yrgb = clamp(srgb +mrgb, 0, 255) (4.16)

Figure 4.14: A picture and associated mask loaded in the tool.

A. Ducommun dit Boudry 37

Page 40: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

4.6.3 Results

This operation is extremely time consuming with an average of four picturesprocessed by hour of work. But the final result is much more accurate as can beseen below.

Figure 4.15: A mask out of the 3d tagging tool. Note: colors have been invertedin this document.

Figure 4.16: The same mask after manual 2d tagging refinements. Note: colorshave been inverted in this document.

A. Ducommun dit Boudry 38

Page 41: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

4.7 Post-processing

4.7.1 Mask Cleaning

The masks out of 2d tagging are algorithmically cleaned using the following steps:

1. A mathematical morphology close operation (dilation then erosion) usinga 5x5 kernel filled with a disk.

2. A mathematical morphology open operation (erosion then dilation) usinga 5x5 kernel filled with a disk.

3. Only mask area having more than 100 pixels are kept.

4. The mask is dilated by two pixels.

These operations ensure that noise (e.g. dots, small patches) are removed andthat the shapes present in the mask are a bit smoothed compared with hand-edited curves.

4.7.2 Per-pixel Weighting

Because the percentage of positive versus negative class instances in the mask isbiased toward negative labels, a new grayscale 16-bits image is generated witha per-pixel weighting schema to compensate for this imbalance. To achieve this,first a tool is created to estimate the occurrence of positive versus negative classes.Then a second tool is given this estimation, and takes the mask multiplied bythe inverse of this ratio. Finally, it adds a bias for the negative instances. Thefirst tool can then be run again to verify that the weighted occurrence is morebalanced between the two classes, ideally close to 50% each.

4.7.3 Geometric Variants

The source images have a resolution of 20 mega-pixels. The machine-learningmodel will process images at a resolution of 256x256 pixels. One option is todownscale the image so that the the smallest side is 256 pixels and then centercrop that image to obtain a single 256x256 image. That would be the easiestsolution but it also means loosing much of the source information at disposal. Abetter approach in this case is to generate or augment the dataset by randomlycropping the image at various location with a window of 256x256 pixels.It is also possible to flip the image along the X and/or Y axis, apply 2d rotationsand multiple scale factors.The dataset used in the next chapter uses the following data augmentation:

A. Ducommun dit Boudry 39

Page 42: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

• No flipping and flipping along X axis.

• Rotating by 0, -5, +5, -10, +10, -15 and +15 degrees.

• Scaling set at 25% and 12.5%.

• Random cropping 20 times.

By combining all these transformations we get 560 different regions of 256x256pixels out of each source image.

4.7.4 Photometric Variants

Additionally to the geometric transformations above, it is also possible to ran-domly tweak the brightness or contrast levels and add artificial noise from auniform or Gaussian distribution.Each 256x256 image variant in the dataset has a brightness and contrast ran-domly adjusted by up to +/- 10% (the random factor is taken from an uniformdistribution).

4.7.5 Training, Test and Evaluation Sets

The dataset will be split in three parts for machine-learning: a training set (72%),a validation or evaluation set (8%) and a final test set (20%).The training set will be used to feed the model during training phase. Thevalidation set will be used to evaluate performances independently at fixed timeintervals. This set performance will be a good indicator if the model is over-fittingthe training data or not. The final test set will be used to assess performancesindependently of hyper-parameters in the model.

4.7.6 Building Final Datasets

The input images will be randomly permuted before being written to disk andthis random ordering must persist across multiple invocations and a source imageand all its variants cannot cross the training/validation/test boundaries.The dataset is actually transformed and written to disk using a TensorFlow par-allel computation graph. TensorFlow is a framework that can execute highlydistributed computations [45]. A detailed discussion on this framework will beincluded in next chapter (see 5.4.1).The storage format is a sequence of TF-Record serialized in binary format, eachcontaining the following fields (about 400 KB): the source and variant names,dimensions, rgb image (24-bits), mask image (8-bits) and weight image (16-bits).

A. Ducommun dit Boudry 40

Page 43: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

Figure 4.17: High-level structure of computation graph used to transform imagesinto 256x256 variants. The structure is replicated N times according to a run-time parameter. The blocks ”scale” and ”variations” contains many operationsthemselves.

Figure 4.18: Computation sub-block used above to perform a single random cropand random brightness/contrast adjustments. This structure is replicated MxNtimes according to a runtime parameter.

A. Ducommun dit Boudry 41

Page 44: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

4.8 Results

The small and medium datasets are used for prototyping on local workstation.The small one can be built within a few minutes and the medium one in about 15minutes. The large set is built in about two hours on a high performance serverdeployed on AWS EC2 computing cloud (c5.4xlarge, 16 virtual CPUs, 32 GB ofRAM).By having immutable datasets on disk, there are two advantages: firstly, nocomputational resource is lost to transform input images into multiple variantsduring training, and secondly, the training session can be replicated on the exactsame data every time.

Name Training Evaluation Test Size on disksmall 99 11 28 44MBmedium 5940 660 1680 2.6GBlarge 55440 6160 15680 24GB

Table 4.4: Generated datasets

Figure 4.19: Four variants of the same source image: from left-to-right, crop,another crop, flipped and crop, rotated and crop. Note: mask and weight colorshave been inverted in this document.

A. Ducommun dit Boudry 42

Page 45: Deep-Learning Image Segmentation

CHAPTER 4. DATASET CREATION

The same kind of operations are performed on another dataset of flower segmen-tations, prepared by Maria-Elena Nilsback and Andrew Zisserman from OxfordUniversity [46]. This dataset will be used as a comparison to the tea leavesdataset.

Figure 4.20: Four sample taken from the flower datasets. Note: mask and weightcolors have been inverted in this document.

4.9 Conclusion

In this chapter was presented an approach to generate a labeled dataset of youngtea leaf images and their segmentation in a natural environment. The 3D re-construction of the scene to label images is an interesting concept and definitelyuseful if the captured point-cloud is accurate and dense enough. The result ontea plants was close to reach that level, but it was unfortunately not sufficientand required further manual processing on the raw 2D images.In terms of final application, which is building a machine-learning model capableof detecting tea leaves to be picked, the pictures are likely not directly usableas they were taken in a tea plantation of a research facility after spring season.At that location, the plants are not pruned and mended the way they are incommercial settings. Nonetheless, the acquired dataset seems diverse and realisticenough to evaluate the performance of machine learning models which will beexplored in next chapter.

A. Ducommun dit Boudry 43

Page 46: Deep-Learning Image Segmentation

Chapter 5

Machine-Learning Applied toComputer Vision

After acquiring and preparing datasets, the next step is to build a machine-learning model capable of making an image segmentation prediction on previouslyunseen pictures.This chapter assumes the reader has some prior knowledge in machine-learningand neural networks.

Figure 5.1: Deep neural network (source: [47])

A. Ducommun dit Boudry 44

Page 47: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.1 Methodology

Modern machine-learning techniques will be studied, more specifically DeepLearn-ing Convolutional Neural Networks (CNNs). Some neural network architectureswill be selected on that basis. To try to achieve the best possible results and com-pare performances, a list of the neural network components and parameters willbe presented. Then a process will be put in place to train, evaluate and reportperformances on various configurations. Finally, the results will be presented anddiscussed.

5.2 Machine-Learning

5.2.1 Artificial Neural Networks

The invention of the digital equivalent to a biological neuron is attributed toWarren McCulloch and Walter Pitts in 1943. In 1957, Frank Rosenblatt presentedthe model of the Perceptron [48]. It can tackle the task of binary classificationby proposing the following neuron activation function:

f(x) =

{1 if θx+ b > 0

0 otherwise(5.1)

where x is the input vector and θ, b are weighting and bias set manually or bylearning. It has been proven that a single layer network built on this structurecannot represent non-linear function such as XOR. Also, at that time, the methodto learn the weights in the network were very limited.

Figure 5.2: Illustration of a digital neu-ron (source: Wikipedia).

Figure 5.3: Illustration of an artificialneural network (source: Wikipedia).

From the 60s, automatic differentiation in the context of control theory, and itsparticular usage in the back-propagation algorithm [49], finally offered a solu-tion to train more complex neural network architectures made of multiple layers.Back-propagation is a method computing the gradient on the error at the outputof the neural network and updating the neuron weights accordingly to bring thenetwork output closer to the desired state. During training phase, the network is

A. Ducommun dit Boudry 45

Page 48: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

first presented an input vector, the neuron function or mapping is then appliedfrom the input layer to the output layer in a sequential process. At the outputlayer, the error and its gradient is computed and finally the weights are updatedin reverse order, hence the term back-propagation.Perceptron-based neural networks with three or more layers are also called MLP(Multi-Layer Perceptrons).Since these early days, many other network architectures, activation functionsand training methods have been studied.

5.2.2 DeepLearning

Even though the fundamental theory of artificial neural networks was inventedin the seventies, they were not very popular and of limited use until about 20years ago (2000s). The advent of more powerful computers, specialized hardwareprocessing units such as GPUs, large scale distributed computing and the digital-ization of medias allowed more and more complex networks to be trained. Thisgeneral shift in computing paradigm as well as the specific techniques to managedeep networks is what could be called DeepLearning: the ability to train manylayers of an artificial neural network on millions of data points.Even if these new computing capabilities allowed the creation of deeper neuralnetworks, several problems remained to be solved. Among the key recent dis-coveries are new solutions to issues such as vanishing and exploding gradients,related to the stability and learning speed of such complex neural architectures.It is currently a fast evolving field of research.Applying input data like texts and audio signals has been done with successfor decades. Images and videos are also an interesting source of informationbut remained difficult to address due to the highly dimensional feature space.By comparison, whereas the total number of commonly used English words isaround 170’000, which could then be encoded in an efficient schema, a singlesmall 256x256 image in color contains already close to 200’000 separated values.One of the first well-known success is hand-written digits recognition used atpostal distribution centers [50].

5.2.3 Convolutional Neural Networks

Inspired by biological studies of the visual cortex in the fifties, researchers realizedthat images could be processed in a different way through an artificial neuralnetwork. Animals detect abstract patterns, textures, edges, contours, shadings,etc. These are very localized features and typically not dependent on the positionin the image. Therefore, the assumption made when analyzing an image at a low-level, is that the neighboring pixels in a 2D region are more important than theones far away. These basic features could then be combined in subsequent layersto form more complex representations, such as combination of edges or contours.

A. Ducommun dit Boudry 46

Page 49: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Therefore a dedicated layer could rely on a spatial convolution or more exactlycross-correlation between an image and set of kernels also called filters, instead offeeding the pixels one by one as a sequence in the fully connected neural networkinputs. In this case, the equivalent to the weights that are learned in a Perceptronmodel are the values in these kernels. As an example using a grayscale 32x32images, whereas a fully connected input layer made of 100 neurons would need100 * 1024 weights or about 100’000 parameters to be trained, a convolutionallayer using 100 kernels of size 5x5 only needs to learn 2’500 parameters.Here is how a convolutional kernel is applied to a particular pixel at coordinates(i, j), using the weight matrix K (MxN) and input image x. This operation hasto be performed for each pair of pixel coordinates (1 ≤ i ≤ H, 1 ≤ j ≤ W ):

yij =

M/2∑s=−M/2

N/2∑t=−N/2

Ks+M/2,t+N/2 · xi+s,j+t

(5.2)

For coordinates (i, j) outside of the input image boundaries, two strategies exist.Either reduce the size of output image y so that 1 ≤ i+s ≤ H and 1 ≤ j+t ≤ W ,or pad the input image with a constant value.

Figure 5.4: 2D convolu-tion (source: [51]).

Figure 5.5: Dilated 2Dconvolution (source: [51]).

Figure 5.6: 2D con-volution with stride(source: [51]).

In reality, the operation works on inputs and outputs of any size. For instance,the input could be a 32x32x3 image (32 by 32 pixels, each pixel having 3 channels,maybe red, blue and green) and the output could be a 32x32x64 image, having 64channels generated from 64 different kernels. To handle multiple input channels,the kernels have an additional dimension: MxNxD (where D is the number ofsource channels to mix), k is the output channel index (1 ≤ k ≤ C) and K(k) isthe kernel associated with that channel:

yijk =

M/2∑s=−M/2

N/2∑t=−N/2

(D∑

u=1

K(k)s+M/2,t+N/2,u · xi+s,j+t,u

) (5.3)

As it can be seen, even though the number of kernel parameters in a convolutionis limited compared to fully connected layers, it is still a complex operation tocompute, with no less than five levels of imbrication.

A. Ducommun dit Boudry 47

Page 50: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

This method can also be used as a space-reduction technique, making the outputspatially smaller than the input, by applying the kernel at an interval greaterthan one. For instance at i, j ∈ {0, 2, 4, 6, . . . }. where the increment is called thestride. Recently, transposed convolutions have been implemented and offer thepossibility to increase the output size compared to the input, in a parametric andtherefore trainable fashion. They function as an efficient inverse operation of aconvolution.

Figure 5.7: Transposed 2D convolution (source: [51]).

Another type of layer is often found in CNN literature, namely the pooling layers.These layers are not trainable and apply a simple function (max or average) onregions in the image, with the goal of reducing or increasing spatial dimensions.They are particularly useful to limit computational complexity and therefore candecrease training and prediction time.Once the spatial dimensionality is reduced enough, near the end of the network,a set of traditional fully connected layers are appended to perform classificationor regression.

Figure 5.8: Illustration of a Convolutional Neural Network (source: Wikipedia).

This type of network has been applied with success to the task of image classi-fication, predicting for instance what kind of object is in the image. They alsohave been applied to image segmentation by predicting the coordinates of thebounding box in which the object lies.

A. Ducommun dit Boudry 48

Page 51: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.2.4 Fully Convolutional Neural Networks

When multiple instances of an object need to be localized or identified, densePerceptron layers are not applicable due to computing resource limitations. Byusing only convolutional layers between the inputs and outputs, it is possible togenerate efficiently an output image of the same size as the input. The networkcreates then a mapping from a source image to a target image according tolearned parameters (kernels). This is the most popular method applied to imagesegmentation, where a per-pixel prediction is expected.As the number of learnable parameters are kept reasonably small at each convo-lutional layer, it is possible to build deeper networks capable of detecting abstractand complex features.

Figure 5.9: Illustration of a Fully Convolutional Neural Network (source: [52]).

5.2.5 Pre-trained Layers

A general technique found in the literature and in practice, is the concept ofreusing pre-trained layers in other models. These layers could have been initiallybeen trained on a different but slightly similar problem. For example, an Auto-Encoder on images of objects could be built and its first few feature maps reusedas a starting point in an object classification model.This approach has been used with success on deep networks, where training alllayers from scratch (i.e. with randomly initialized weights) was either too slow ornot giving good results. It can also be a way to built a better model with fewertraining data, by reusing what has been learned from the prior trainings.

5.2.6 GoogLeNet

GoogLeNet [53] is an interesting architecture, combining convolutions using dif-ferent kernel sizes and local response normalization (see 5.3.9). This particulararchitecture was named ”Inception module”:

A. Ducommun dit Boudry 49

Page 52: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Figure 5.10: Inception module architecture (source: Google)

5.2.7 ResNet

The residual network [54] is a noteworthy architecture because instead of formu-lating the problem as:

y = f(x, θ1) (5.4)

where f(x, θ1) is any function - typically one or several neural network layer(s) -mapping between input x and output y using learned weights θ1, the problem isequivalently redefined as:

y = x+ g(x, θ2) (5.5)

where g(x, θ2) will be effectively the learned difference from input x to get outputy using a different set of weights θ2. It is not immediately obvious why this mightbe a good idea. The reason being actually that when a stack of layers is madeof such modules, and assuming for now that the initial weights θ2 are zeros, eachlayer output is initially equal to x, and therefore of the same magnitude acrossthe whole network.

Figure 5.11: ResNet architecture (source: [54])

In practice, the weights are randomly initialized but the output will be close tox and in any case not weaker and weaker down the network. That ensures that

A. Ducommun dit Boudry 50

Page 53: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

the signal is initial propagated throughout the whole neural networks, even thedeepest layers. This architecture was demonstrated improving training speed andefficiency of deep networks.

5.2.8 U-Net

An application of FCNN (Fully Convolutional Neural Network) is the U-Netarchitecture [55]. The key aspect is to add forward connections from layers closeto input, to layers close to output having the same image scale (gray arrows inthe diagram below).

Figure 5.12: U-Net architecture (source: [55])

By fusing the information at various scales, the network is capable of betterpattern recognition for image segmentation. This architecture will be the mainsource of inspiration for the models studied in this chapter.

5.2.9 DARTS

A very recent research [56] published on the topic of finding the best model archi-tecture using an approximation of the principle used to train the network weights.By representing the architectural choices and inter-layer/module connections bya continuous and differentiable function, it becomes possible to search for thebest architecture using back-propagation too.In the future, this concept will probably lead to notable advances in terms ofmodel compactness and accuracy.

A. Ducommun dit Boudry 51

Page 54: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.3 Model Hyper-Parameters

Any user-definable parameter fixed during runtime is considered a hyper-parameter.Below are a list of such parameters or choices used in this study.

5.3.1 Batch Size, Steps and Epochs

The first parameter is the number of training samples that are fed into the layersat the same time. It is called the batch or mini-batch size. A batch size equalto one is effectively the same as feeding training samples one by one. In modernneural networks, samples are given in batch as it is computationally more efficienton GPUs. Also, batches have been shown to improve learning speed and stability.Each time a batch is processed (feed forward and back-propagation), it is calleda step. An epoch is a certain number of steps, after which an evaluation canbe done. The number of steps and epochs define the training duration and thefrequency at which metrics are evaluated.

5.3.2 Loss Functions

A loss function computes the error between the prediction (network output) andexpected label. The function should ideally be continuous and differentiable.Many loss functions are used in practice, some of which are described below (y isthe network output and t is the label).The absolute difference and squared error are typical general purpose loss func-tions:

ℓabs(y, t) = |y − t| where y, t ∈ R (5.6)

ℓmse(y, t) = (y − t)2 where y, t ∈ R (5.7)

Compared to squared error, the Huber loss function is less sensitive to valuesoutside of the range defined by [−δ, δ]:

ℓhuber(y, t) = 0.5(y − t)2 + min(0, δ(|y| − δ)) where y, t ∈ R (5.8)

The Hinge loss function is applicable to binary classification problems and is alsoused in SVMs (Support Vector Machines):

ℓhinge(y, t) = max(0, 1− yt) where y, t ∈ [−1, 1] (5.9)

The log loss or cross-entropy function is usually used when dealing with proba-bilities of multiple classes:

ℓlog(y, t) = −tlog(y)− (1− t)log(1− y) where y, t ∈ [0, 1] (5.10)

A. Ducommun dit Boudry 52

Page 55: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

The IOU (Intersection-Over-Union) loss [57] is based on an approximation of theJaccard similarity coefficient. It is interesting for image segmentation where theobjective is to measure the accuracy of the segmentation:

ℓjaccard(y, t) = 1− yt

y + t− ytwhere y, t ∈ [0, 1] (5.11)

In case of vectors or flattened matrices, the error can be reduced to a scalar bytaking the sum of each component:

ℓsum(y, t) =n∑

i=1

ℓ(yi, ti) (5.12)

or the average, which differs only by a constant factor:

ℓavg(y, t) =1

n

n∑i=1

ℓ(yi, ti) (5.13)

5.3.3 Weighted Per-pixel Loss

In image segmentation, the distribution of classes can be uneven. For instancein binary segmentation, positive pixels could be two orders of magnitude lessfrequent than negative pixels. To produce a loss function more adapted to sucha distribution, the following mapping can be done (where the label weights areadjusted by ω):

ℓweighted(y, t, ω) =n∑

i=1

ωiℓ(yi, ti) (5.14)

5.3.4 Learning Schedules

The rate at which the learning updates the parameters in the model is controlledby a scalar learning rate. The higher the rate, the faster the parameters shouldconverge towards the optimal solution and vice-versa. But in reality, having ahigh rate will surely overshoot the optimal solution, could produce oscillationtowards the optimum and could eventually diverge completely from the globaloptimum. Setting a rate too low will require a longer training time and thesolution might converge to a local optimum and stay stuck there. Finding theideal rate is a difficult problem with no definite solution. A typical approach is totry various rates to evaluate which one gives better results in a specific situation.A second degree of freedom is to define the learning schedule, which is the functionused to modelize how the learning rate will evolve over time. Learning rate decaystrategies exist to benefit from the high learning rate advantages at the beginning

A. Ducommun dit Boudry 53

Page 56: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

of training and gradually reducing it to reach a stable optimum. Here are somestrategies (i is the current training step, α is the fixed learning rate).A constant rate:

sconst(i) = α (5.15)

A rate inversely proportional to time:

sitd(i) =α

1 + βiwhere β > 0 (5.16)

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

step

lear

ning

rat

e

Figure 5.13: Learning schedule where α = 1, β = 0.1.

A rate producing an exponential decay:

sexp(i) = αe−βi where 0 < β < 1 (5.17)

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

step

lear

ning

rat

e

Figure 5.14: Learning schedule where α = 1, β = 0.1.

Many other variants exist. One interesting possibility is to make the learningschedule cyclic, by resetting the rate to some eventually decaying value after atime interval and applying the decay formula again. This has been used success-fully to overcome local minima problems. For instance, a linear cosine decay canbe defined as:

slincos(i) = α(j1 + cos 2πfj

2+ β) where j =

n−min(i, n)n

(5.18)

A. Ducommun dit Boudry 54

Page 57: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

steple

arni

ng r

ate

Figure 5.15: Learning schedule where α = 1, β = 0.1, f = 3, n = 100.

It is also possible to divide the global learning schedule into separated schedulesfor groups of layers in the neural networks. When using pre-trained layers, thiscan be used to freeze or reduce the learning rate of the first layers (close to inputs),forcing the optimizer to update and specialize in priority the deepest layers.Finally, it is worth to mention a heuristic designed to estimate the optimal learn-ing rate. It works by using the exponential decay but in this case using a β < 0,which will increase the learning rate over time. If a training session is run usingthat schedule, it is possible to plot the loss with respect to the learning rate ona graph. The exponential increase should not start immediately at step zero, toideally let the neural network converge a little towards the optimum. From there,the rate increases and should produce a smaller loss, until it gets too large andovershoot the optimum. That is the threshold right below which the learningrate might be optimal.

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

step

lear

ning

rat

e

Figure 5.16: Learning schedule to find optimal learning rate.

0.12 0.14 0.16 0.18 0.2 0.220.1

0.12

0.14

0.16

0.18

0.2

learning rate

loss

Figure 5.17: Learning rate versus loss: Heuristic to find optimal learning rate.

A. Ducommun dit Boudry 55

Page 58: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

The optimal learning rate according to this heuristic is found at the bottom ofthe curve, where the loss is the smallest. Note that this is only a heuristic andno guarantee is made that this will work in all scenarios.

5.3.5 Optimizers

The optimizer is the general term for the software component responsible to up-date the learnable parameters in the model. An optimizer expects a loss functionto be minimized, a learning schedule and the list of parameters to update. Theoptimizer computes the gradient of the loss function in order to update the pa-rameters. The following optimizers were tested:

Gradient descent: traditional implementation of the back-propagation algorithmavailable and used for decades. It is the reference implementation to compareother optimizers.

Momentum: takes into account the previous errors by computing a running aver-age or momentum of the loss function gradient. It has the potential to overcomelocal minima better than gradient descent [58].

Nesterov Momentum: a small improvement on momentum optimizer by lookingahead of time, trying to predict where the gradient will be according to the currentmomentum [59].

RMSProp: using an adaptive learning rate and improving on the ADAGrad op-timizer, which takes into account where the gradient is the steepest to normalizethe descent toward the optimum [60, 61].

Adam: combines Momentum with RMSProp [62].

Nadam: combines Nesterov Momentum with RMSProp [63].

Note: describing in details how these optimizers are implemented is not in thescope of this document.

5.3.6 Activation Functions

In a neural network layer, the inputs are mapped and accumulated by a set ofweights. The resulting value is often a linear combination of the inputs. Tomodelize non-linear relationships and produce highly contrasted output signals,an activation function is added. From the definition of the Perceptron earlier inthis chapter (see 5.2.1), the accumulation and activation can be separated in twofunctions:

f(x) = θx+ b

h(y) =

{1 if y ≥ 0

0 otherwisewhere y = f(x)

(5.19)

A. Ducommun dit Boudry 56

Page 59: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

In the formula above, f(x) is the linear combination of the input x with weights θ,bias b and h(y) is the activation function used as the output of the neuron. Thisparticular activation function is called the Heaviside step function and is used inother scientific fields. It is not used much in neural networks anymore becauseit is a non-continuous and non-differentiable function, therefore interfering withthe back-propagation algorithm.

-5 -4 -3 -2 -1 0 1 2 3 4 5-2

-1

0

1

2

y

h(y)

Figure 5.18: Heaviside activation

The most trivial activation function is the identity function and is equal to havingno activation function:

h(x) = x (5.20)

The sigmoid, also called logit or logistic, is a popular and useful function in theunconstrained case x ∈ R, because it will normalize the output between [0, N ]:

h(x) =N

1 + e−x(5.21)

-5 -4 -3 -2 -1 0 1 2 3 4 5-2

-1

0

1

2

y

h(y)

Figure 5.19: Sigmoid activation (N = 1)

The tanh function has a smooth output, ranging in [−1, 1]:

h(x) = tanh(x) = ex − e−x

ex + e−x(5.22)

A. Ducommun dit Boudry 57

Page 60: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

-5 -4 -3 -2 -1 0 1 2 3 4 5-2

-1

0

1

2

yh(

y)

Figure 5.20: Hyperbolic tangent activation

In DeepLearning, a simple albeit effective function is the ReLU [64, 65] and Leaky-ReLU (Rectified Linear Unit). They are trivial to compute, non-saturating whichis an advantage in deep networks, but not differentiable at x = 0, which it turnsout, is not a big issue in practice (note: in ReLU α = 0):

h(x) =

{x when x ≥ 0

αx otherwise(5.23)

-5 -4 -3 -2 -1 0 1 2 3 4 5-2

-1

0

1

2

y

h(y)

Figure 5.21: ReLU activation (α = 0)

Leaky-ReLU is a variant used to avoid a problem called ”dying ReLU”, which iscaused by h(x) = 0 when x is less than zero. If that situation arises for a long timein a neuron, the network could perpetually ignore that neuron by presenting aninput x never capable of reaching zero or above given the input signal, renderingthat neuron useless. A simple fix is to forward a small signal modulated by afactor α which is usually set at 0.01 (scaling down the output signal by two ordersof magnitude).

-5 -4 -3 -2 -1 0 1 2 3 4 5-2

-1

0

1

2

y

h(y)

Figure 5.22: Leaky-ReLU activation (alpha = 0.1)

A. Ducommun dit Boudry 58

Page 61: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

A specific variant exist for convolutional layers, called CReLu [66]. It works byseparating the output signal in two distinct channels (therefore multiplying bytwo the total number of output channels), one for positive activation and one fornegative activation:

h+(x) =

{x when x ≥ 0

0 otherwise

h−(x) =

{0 when x ≥ 0

−x otherwise

(5.24)

-5 -4 -3 -2 -1 0 1 2 3 4 5-2

-1

0

1

2

y

h(y)

Figure 5.23: CReLU activation: red is positive activation channel, blue is negativeactivation channel.

5.3.7 Kernel Initializers

A key problem in deep networks is the stability and magnitude of output valuesat each layer. Before this problem was analyzed, the weights were initialized withrandom values taken from a uniform [−1, 1] or normal distribution µ = 0, σ = 1,which caused the scale of weights and/or output values at each layer to eitherincrease or decrease over network depth and training time, producing untrainablenetworks.Yann Lecun [67] et al. first proposed the specific initialization schema using anormal distribution with these parameters (fin is the fan-in of the layer, i.e. thenumber of input weights per neuron in the layer):

µ = 0, σ =

√1

fin(5.25)

or with a uniform distribution in range [−L,L]:

L =

√3

fin(5.26)

A. Ducommun dit Boudry 59

Page 62: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Kaiming He and his team [68] found that for networks using ReLU activationfunction, the following initialization schema using a normal distribution mightproduce better results:

µ = 0, σ =

√2

fin(5.27)

or with a uniform distribution in range [−L,L]:

L =

√6

fin(5.28)

Xavier Glorot and Yoshua Bengio [69] found in 2010 that combined with a sig-moid activation function, this random initialization was not optimal and couldbe improved by choosing random weights from a normal distribution with theparameters below (fout is the fan-out of the layer, i.e. the number of outputconnections per neuron in the layer):

µ = 0, σ =

√2

fin + fout(5.29)

The weights could alternatively be picked randomly from a uniform distributionin range [−L,L]:

L =

√6

fin + fout(5.30)

5.3.8 Kernel and Gradient Regularizers

Gradient normalization could be done by clamping the gradient to a specificrange of value it can take [70]. This avoids abnormally large gradients to affectthe weights of the network.Finally, a method called kernel regularization apply constraints on the learnedweights, which can limit over-fitting the data. Instead of letting them freelychange according to the back-propagated gradient, they are kept within specificboundaries or within specific shapes by adding a penalty on the unwanted weights.General purpose methods to constrain the weights are L1 and L2 norm regular-izers, which are additional costs added to the loss function (λ is a user-definedfactor).

ℓL1(θ) = λn∑

i=1

|θi|

ℓL2(θ) = λn∑

i=1

θ2i

(5.31)

A. Ducommun dit Boudry 60

Page 63: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Here is an example of an L2 regularized weights combined with a weighted-losson an arbitrary function:

ℓtotal(y, t, ω, θ) =n∑

i=1

ωiℓ(yi, ti) + λθ2i (5.32)

5.3.9 Local Response Normalization

An additional constraint on convolutional layer outputs can help each kernel orfilter in the layer diversify or specialize. The idea is that if a convolutional layerhas multiple kernels, each producing an output channel, it is desirable that eachof them is performing a different operation, transforming the input data in diverseways.A solution to accomplish this is called LRN (Local Response Normalization) [71]:in essence, it combines the output of pixel i, j on channel k with the output ofneighboring channels within a specified distance in a way that the strongest signalinhibits other similar signals in the neighborhood.

5.3.10 Batch Normalization

A technique to ensure the output of a layer is normalized to avoid vanishing orexploding gradients is called batch normalization [72]. It computes the empiricalmean and standard deviation of the mini-batch to normalize the inputs. It canalso optionally rescale (α) and add a bias (β) before outputting the values:

yi = αxi − µbatch

σbatch

+ β where σ ̸= 0 (5.33)

5.3.11 Dropout Rate

To limit over-fitting on training data, dropout layers can be added [73, 74]. Theyinhibit a random fraction of the inputs by setting them to zero, forcing the net-work to rely on the overall important signals and not specific ones. The fractionrandomly selected is changed after every iteration.

A. Ducommun dit Boudry 61

Page 64: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.4 Implementation

5.4.1 TensorFlow

The convolutional neural network is trained and evaluated using TensorFlow, amachine-learning framework developed by a team at Google [45]. Compared toNumPy [75] or other scientific computation and machine-learning libraries, themajor difference is that the operations are not executed immediately where theyare defined.In NumPy/SciPy, the main building block is called a ndarray: a multi-dimensionalarray-like structure that can hold data of a scalar, vector or matrix. Many math-ematical operations can then be performed on them. In TensorFlow, the equiva-lent is called a tensor, with a twist: it can have a partially defined shape due todeferred execution mechanism.

import numpy as np

# computing directly y=a*x+ba = 2.0b = 0.5x = np.array([0, 1, 2, 3, 4, 5])y = a * x + bprint(y) # [ 0.5 2.5 4.5 6.5 8.5 10.5]

Listing 1: Eager execution using NumPy

In TensorFlow, a computation graph needs to be created first, which can then beexecuted by feeding input data and finally, the expected output result is received.

import tensorflow as tf

# building a graph computing y=a*x+ba = tf.constant(2.0)b = tf.constant(0.5)x = tf.placeholder(tf.float32, shape=[None])y = a * x + b

# executing the graph with a particular xwith tf.Session() as s:

result = s.run(y, {x: [0, 1, 2, 3, 4, 5]})print(result) # [ 0.5 2.5 4.5 6.5 8.5 10.5]

Listing 2: Differed execution using TensorFlow

This design has advantages and drawbacks. On the positive side, many opti-mization techniques can be applied because the computation graph is a DAG(Directed Acyclic Graph), for instance finding dependencies, enabling reorder-ing and merging of operations. Also, the graph can be easily split into parallel

A. Ducommun dit Boudry 62

Page 65: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

branches forking and joining along the computation flow, which can be dispatchedto different devices for execution. A part of the graph could be executed on mul-tiple CPU cores, while another is scheduled on a GPU. The graph computationscan even be distributed over a network of hundreds of machines, without chang-ing any line of code (except configuring at a high-level of abstraction where theexecution can be dispatched). On the negative side, the algorithms and sequenceof operations are much harder to write, in particular handling predicates, loopsand variables. Additionally, debugging is also difficult because the execution isdiffered. To alleviate these issues, an optional eager execution mode was addedrecently in the framework, usually activated during development and disabled forlarge scale computations.What TensorFlow adds on top of standard mathematical operations is a completesuite of mathematical functions and algorithms applied to machine-learning andoptimization problems in general: data pre- or post-processing, neural networklayers, predefined activation and loss functions, optimizers, etc. Another corefeature is the implementation of automatic differentiation on arbitrary operations:computation of the gradient of a function is therefore trivial and efficient inTensorFlow.Last but not least, the framework offers a web-based visualization tool calledTensorBoard reporting in real-time the training statistics and evaluation met-rics. Interestingly, user-defined metrics and values can be easily added to thedashboard.Multiple options exist to define a neural network model in TensorFlow, includ-ing the optimizer for training and metrics for evaluation. On the lowest levelof abstraction, tensors can be manipulated using basic operations (assignments,additions, multiplications, etc). It is therefore possible to implement a completeneural network training logic with gradient descent by hand. At a medium levelof abstraction, the module Keras can be used to build neural networks using com-ponents like Perceptron or convolutional layers. Some of the logic required to dotraining and evaluation is also implemented. On the highest level, the Estimatormodule offers most of the training and evaluation logic, model persistence on diskand even ready-made architectures. An Estimator can perform three tasks: trainthe model from a given input dataset, evaluate the performance of the model andmake prediction by using the learned model.In this project, the Estimator module is used as a foundation, with a custommodel built in part with Keras and a few low-level computations (e.g. to definecustom loss function). TensorBoard is used to report a complete set of metricsover time, such as loss value, outputs after activation, input/output images, etc.A good reference used throughout this project is the book from O’REILLY”Hands-On Machine Learning with Scikit-Learn & TensorFlow” [76] dedicatedto implementation details of a machine-learning model in TensorFlow.

A. Ducommun dit Boudry 63

Page 66: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.4.2 Topology Template

Because defining and connecting the layers in a neural network is a time con-suming and error-prone task, a specific configuration format has been definedto simply the process. The proposed file format is based on YAML - a humanfriendly format similar to JSON feature-wise. This topology definition is theninstantiated as a trainable module with given inputs and outputs in a softwareimplementation based on TensorFlow (see Topology Format).Several network topologies can be therefore quickly instantiated, tested and com-pared. Each layer can be parametrized with at least: a unique identifier, thetype of operation and its inputs. Optionally, multiple compatible inputs canbe concatenated, operation parameters can be specified (e.g. kernel size of aconvolution) as well as the activation function and its normalization.A default configuration is given at runtime so that a single topology can be testedusing different optimizers, activation functions, kernel regularization and so on.But if a specific layer needs a particular configuration, it is always possible tooverride the default value.The template used is archived in the trained model with the exact runtime pa-rameters so it is possible to simulate the exact same training conditions in thefuture.

5.4.3 Metrics

The segmentation problem is defined as a one-hot multi-class encoding (one versusothers classification). In the case of binary segmentation, the output of the neuralnetwork is an image with two channels: the first representing the background,the second representing the foreground. The loss function is then computing theerror on each with respect to the associated label.To evaluate the trained models, a set of metrics are computed at the end of eachepoch and will be observed through time. To compute these metrics, the followingdefinitions are necessary (in the scope of binary segmentation, but this reasoningcan be expanded to multiple classes):

True positive (tp): number of positive pixels where the associated label is alsopositive.

False positive (fp): number of positive pixels where the associated label is nega-tive (also called Type I errors in statistics).

True negative (tn): number of negative pixels where the associated label is alsonegative.

False negative (fn): number of negative pixels where the associated label is pos-itive (also called Type II errors in statistics).

A. Ducommun dit Boudry 64

Page 67: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

These statistics are usually stored in a confusion matrix:

LabelP N

Prediction P tp fpN fn tn

The precision score is defined for each class as the ratio of true positive versustrue and false positives:

Sprecision =tp

tp+ fp(5.34)

The recall score is defined for each class by the ratio of true positive versus truepositive and false negative:

Srecall =tp

tp+ fn(5.35)

relevant elements

selected elements

false positivestrue positives

false negatives true negatives

Precision = Recall =

How many selecteditems are relevant?

How many relevantitems are selected?

Figure 5.24: Precision and recall (source: Wikipedia)

A. Ducommun dit Boudry 65

Page 68: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

In the case of binary segmentation, a high precision score for the class (foregroundor background) means that the predicted positive segmentation correspond to thelabeling. It doesn’t mean that all the positive labels are detected, just that whathas been detected as positive is correct. Conversely, a low precision score meansthe model predicted many positive regions which are in fact not positive in theassociated label.The recall on the other hand estimate how much of the positive segmentationwas detected. A high score means that all the positive regions in the label aredetected, but it doesn’t mean it is accurate: more positive regions may have beendetected but are actually not positive in the label. A low score indicates manypositive areas are missing from the prediction.It is possible to trade precision for recall and vice-versa. But a good model willneed to have both a high precision and recall scores. In the case of tea leavessegmentation, it is desirable to have a high recall score on the foreground classeven if its precision is a bit lower, because a second evaluation will be necessaryat the marked locations to identify the types of pick (bud only, one bud and oneleaf, etc) and its pose relative to camera (see 6). That second prediction couldclassify the area as ”not to pick” if it turns out that this first model made amistake in its prediction.It is possible to aggregate precision and recall in a single F1 score:

SF1 = 2SprecisionSrecall

Sprecision + Srecall

(5.36)

Finally, in the case of image segmentation, the Jaccard index also called IOU(Intersection-Over-Union) is often used because it estimates well how close thesegmentation is compared to the label (y is the prediction and t is the label):

Sjaccard =|y ∩ t||y ∪ t|

(5.37)

A BA∩B

Figure 5.25: Intersection of two sets(source: Wikipedia)

A BA∪B

Figure 5.26: Union of two sets(source: Wikipedia)

5.4.4 Training and ValidationIn total, four applications have been implemented. Two applications for trainingand evaluating a model. One for making a segmentation prediction on unseen im-ages, and a fourth application to schedule automatically multiple training sessionswith varying hyper-parameters.

A. Ducommun dit Boudry 66

Page 69: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

The training application requires three mandatory parameters:

• The topology YAML to use to instantiate a model.

• A training dataset which is fed into the model.

• An evaluation dataset to report selected metrics at the end of each epoch.

The following table lists available options and their usage (non-exhaustive):

Option Type Default Descriptiondevice string /cpu:0 TensorFlow device to useram bool false Preload dataset in RAMmodel string (random) Model storage folderbatch int #-of-cpus Mini-batch sizesteps int -1 Number of steps per epochepochs int 1 Number of epochsoptimizer string gd Optimizerlearningrate float 0.001 Initial learning ratelearningdecay string none Learning rate decay strategyloss string absdiff Loss functionactivation string sigmoid Default activation functiondropoutrate float 0.0 Default dropout ratelabelweight float 0.0 Label extra weighting biasinitializer string none Default kernel initializerregularizer string none Default kernel regularizerconstraint string none Default kernel constraintverbose bool false Log more statistics during training

Table 5.1: Training options

The evaluation application requires the model storage folder and one or multipleevaluation dataset(s). As the hyper-parameters of the model are stored in thefolder, they don’t need to be specified again on the command-line.The prediction application takes the same arguments as for evaluation, but canalso accept an arbitrary image folder to make prediction on unseen user-definedimages. In this case, the raw images are divided into regions of 256x256 pixels atruntime.The multi-training scheduler application uses another YAML input file, this timedescribing the hyper-parameters to use for each training session (see TrainingParameters Format).

5.4.5 Scaling-up Using GPUs

Initially, the models are trained and evaluated on a local workstation. As themodel topologies grow in complexity and many hyper-parameters need to beevaluated, using Amazon EC2 computing cloud is the next logical step.

A. Ducommun dit Boudry 67

Page 70: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Virtual servers are rented by the minute to run training sessions, reading inputdatasets and exporting models from/to Amazon S3 storage cloud. The followingserver instances where used:

Name CPUs RAM GPU $/h Usagec5.4xlarge 16 32 GB none 0.77 Building datasetsp2.xlarge 4 61 GB K80 (12 GB) 0.97 Training medium modelsp3.2xlarge 8 61 GB V100 (16 GB) 3.31 Training large models

Table 5.2: Amazon EC2 instances

Note: the price per hour is indicative only (for Ireland region in July 2018).A set of shell scripts is implemented to automatically launch, configure and man-age the instances. A simple task scheduler is also deployed on the servers to queuetraining tasks locally and monitor their completion. The scheduler shutdowns themachine once all the tasks are processed to keep the costs under control.

A. Ducommun dit Boudry 68

Page 71: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.5 Model Architectures and Parameters

5.5.1 High-level TensorFlow Graph

The general TensorFlow computation graph for the complete model looks likethis:

Figure 5.27: TensorFlow model graph

The dataset is read and parsed by the input_fn block. It converts images intofloating-point representation, each channel in the range [0, 1] and applies a rgbimage normalization technique. The topology block contains the instantiated com-ponents defined in the topology YAML file (see 5.5.3). The labels (segmentationmasks) are transformed into two classes one-hot encoded by block labelling: thefirst class being the background and the second the foreground. A per-pixel classindex tensor is also constructed for some loss functions that require this informa-tion (index 0 is the background, and index 1 is foreground). The normalizationblock adds a 2D convolution to map the outputs channels of the topology to thesame cardinality as the N-classes in labels (2 in case of binary segmentation).It also applies a softmax activation function to obtain the final predictions as a2D array of most likely per-pixel class index (equivalent to a bitmap for binarysegmentation). Finally, the loss_fn is the implementation of the loss function,taking the tensor after 2D convolution from the normalization block.

A. Ducommun dit Boudry 69

Page 72: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.5.2 Building Blocks

Hereafter is a list of a few useful topology blocks that can be combined to makecomplex convolutional neural networks:

CONV2D: 2D convolution to scale down the input or map input channels intoanother representation.

TR.CONV2D: transposed 2D convolution to scale up the input.

CONCAT: concatenate compatible inputs into a single tensor.

ACTIVATION: activation function h(y) (can be omitted if an identity functionis desired).

LRN: local response normalization (optional).

BN: batch normalization (optional).

DROPOUT: dropout layer (optional).

The three diagrams hereunder are examples of combinations of these blocks.Many more possibilities exist.

Figure 5.28: Simple down-scaling module

A. Ducommun dit Boudry 70

Page 73: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Figure 5.29: Down-scaling module with varying kernel sizes

Figure 5.30: Up-scaling module with varying kernel sizes

A. Ducommun dit Boudry 71

Page 74: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.5.3 Evaluated Topologies

Topologies of various depths have been defined and evaluated:

UNET1Layer Input(s) Output Parametersinputs - 256x256x3 -down1_2 inputs 128x128x4 kernel: 7x7, strides: 2down1b down1_2 128x128x12 kernel: 1x1, strides: 1up1_2 down1b 256x256x3 kernel: 7x7, strides: 2up1b up1_2+inputs 256x256x6 kernel: 1x1, strides: 1

Table 5.3: Topology unet1.yaml

UNET2Layer Input(s) Output Parametersinputs - 256x256x3 -down1_2 inputs 128x128x4 kernel: 5x5, strides: 2down1b down1_2 128x128x12 kernel: 1x1, strides: 1down2_2 down1b 64x64x12 kernel: 7x7, strides: 2down2b down2_2 64x64x36 kernel: 1x1, strides: 1up2_2 down2b 128x128x12 kernel: 7x7, strides: 2up2b up2_2+down1b 128x128x12 kernel: 1x1, strides: 1up1_2 down1b 256x256x3 kernel: 5x5, strides: 2up1b up1_2+inputs 256x256x6 kernel: 1x1, strides: 1

Table 5.4: Topology unet2.yaml

This pattern is repeated up to level five, where the network has 20 layers anddown5b has a 8x8x324 tensor output.

Figure 5.31: Unet2 topology in TensorFlow

A. Ducommun dit Boudry 72

Page 75: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Additionally, here are variants using multiple kernel sizes at each depth-level:

UNET1BLayer Input(s) Output Parametersinputs - 256x256x3 -down1_1 inputs 128x128x4 kernel: 3x3, strides: 2down1_2 inputs 128x128x4 kernel: 5x5, strides: 2down1_3 inputs 128x128x4 kernel: 7x7, strides: 2down1b down1_1..3 128x128x12 kernel: 1x1, strides: 1up1_1 down1b 256x256x3 kernel: 3x3, strides: 2up1_2 down1b 256x256x3 kernel: 5x5, strides: 2up1_3 down1b 256x256x3 kernel: 7x7, strides: 2up1b up1_1..3+inputs 256x256x12 kernel: 1x1, strides: 1

Table 5.5: Topology unet1b.yaml

UNET2BLayer Input(s) Output Parametersinputs - 256x256x3 -down1_1 inputs 128x128x4 kernel: 3x3, strides: 2down1_2 inputs 128x128x4 kernel: 5x5, strides: 2down1_3 inputs 128x128x4 kernel: 7x7, strides: 2down1b down1_1..3 128x128x12 kernel: 1x1, strides: 1down2_1 inputs 64x64x12 kernel: 3x3, strides: 2down2_2 inputs 64x64x12 kernel: 5x5, strides: 2down2_3 inputs 64x64x12 kernel: 7x7, strides: 2down2b down2_1..3 64x64x36 kernel: 1x1, strides: 1up2_1 down2b 128x128x12 kernel: 3x3, strides: 2up2_2 down2b 128x128x12 kernel: 5x5, strides: 2up2_3 down2b 128x128x12 kernel: 7x7, strides: 2up2b up2_1..3+down1b 128x128x12 kernel: 1x1, strides: 1up1_1 down1b 256x256x3 kernel: 3x3, strides: 2up1_2 down1b 256x256x3 kernel: 5x5, strides: 2up1_3 down1b 256x256x3 kernel: 7x7, strides: 2up1b up1_1..3+inputs 256x256x12 kernel: 1x1, strides: 1

Table 5.6: Topology unet2b.yaml

Again, this pattern is repeated up to level five, where the network has 40 layersand down5b has a 8x8x972 tensor output.

Figure 5.32: Unet1b topology in TensorFlow

A. Ducommun dit Boudry 73

Page 76: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.6 Results

5.6.1 Tested Hyper-Parameters

activation functions: none, sigmoid, tanh, softmax, softplus, softsign, relu, relu6,leaky-relu, selu, crelu, elu.

loss functions: absolute difference, mean squared error, huber, hinge, log-loss,sigmoid, softmax, sparse softmax, jaccard.

initializer functions: unit, glorot/xavier, he, lecun.

optimizers: gradient descent, momentum, nesterov, rmsprop, adam, nadam.

learning rates: constants, polynomial decay, exponential decay, cosine decay, lin-ear cosine decay, noisy linear cosine decay.

normalization: with and without lrn, batch normalization, dropout.

topologies: unet1, unet1b, unet2, unet2b, ..., unet5, unet5b.

5.6.2 Remarks

In the tables hereafter, FG denotes the foreground class while BG denotes thebackground class. The best result for a given metric is highlighted in bold.Over-fitting evaluation is performed at the end of each epoch, using the evaluationdataset set aside for this purpose. If the evaluation loss doesn’t improve for someepochs, training is stopped and the trained model parameters at the epoch havingthe lowest loss are kept.An additional test set is available for final evaluation: it is used only after hyper-parameter optimization is completed.Unless otherwise specified, training and evaluation is done on an Amazon EC2p2.xlarge instance equipped of an Nvidia Tesla K80 (12GB).To evaluate performances on the flower dataset, only topologies are compared.The other hyper-parameters are chosen according to the selection process per-formed for the tealeaves dataset.Training on the tealeaves dataset is compared for the classes of hyper-parametersand the best or most sensible choice is made for each. That’s a limited explorationof the possible combinations of hyper-parameters, though as it will be clearerbelow, topologies are clearly not optimal and will need further investigations inthe future.It should be noted that the variance of measurements is not estimated in thisexperiment, due to the limited budget available.

A. Ducommun dit Boudry 74

Page 77: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.6.3 Flower Dataset

Here are the test results after training on 21’960 different training examples, usingan Amazon EC2 p3.2xlarge instance equipped of an Nvidia Tesla V100 (16GB):

Parameter Valuebatch size 50steps 25max. epochs 50optimizer nadamlearning rate 0.001 (constant)loss function jaccardactivation function lrelukernel initializer glorot (normal)lrn activebatch-norm activedropout 50%

Table 5.7: Flower: fixed hyper-parameters

Topology T.T. E.S. Precision Recall F1 IOUmin. FG BG FG BG FG BG

unet1 3 10 89.8% 89.0% 94.5% 80.5% 0.92 0.85 0.793unet1b 7 19 92.2% 90.1% 94.8% 85.3% 0.93 0.88 0.829unet2 3 11 87.4% 91.9% 96.4% 74.7% 0.92 0.82 0.774unet2b 3 6 92.3% 91.9% 95.8% 85.4% 0.94 0.89 0.841unet3 4 10 82.9% 97.1% 99.0% 62.9% 0.90 0.76 0.719unet3b 4 6 93.8% 90.3% 94.8% 88.6% 0.94 0.89 0.850unet3c-lrnonly

16 11 95.1% 91.0% 95.1% 91.0% 0.95 0.91 0.871

unet4 3 6 85.2% 96.1% 98.5% 68.9% 0.91 0.80 0.756unet4b 5 6 93.4% 91.6% 95.6% 87.7% 0.94 0.90 0.853unet5 7 11 85.2% 93.0% 97.1% 69.3% 0.91 0.79 0.745unet5b 6 6 90.2% 93.9% 97.1% 80.9% 0.94 0.87 0.824

Table 5.8: Flower: results by topology (T.T. is training time, E.S. is the early-stopepoch). Note: these results are evaluated on the final test set.

Good results are easily obtained on this dataset. Even the first topology (unet1b)achieves precision and recall of almost 90% on both classes. In theory, a morecomplex model will be able to capture detailed relationships between the sourceimage and its segmentation. As reported above, the deeper and wider a topologyare, the better the IOU performances but at the cost of a longer training due to alarger number of trainable parameters. At prediction time, a complex model willalso require more RAM memory and more computations to be performed, whichmeans it is not practical to increase depth and span indefinitely.The provided labels - also called ground truth - are subjectives and in some casesthe predicted segmentation is arguably better (e.g. holes within flowers where thebackground is visible). This means there is no chance of achieving 100% accuracywithout sacrificing either precision or recall.

A. Ducommun dit Boudry 75

Page 78: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

To illustrate the differences between topologies, below are some predictions. Inorder to better visualize the quality of segmentation, a color pattern is defined:true positives are in green, true negatives are in black, false positives are in redand finally false negatives are in magenta. So an ideal representation using thiscolor schema would contain only green and black pixels.This particular instance’s segmentation is more difficult to predict than others,due to the background features. It’s interesting to see on a qualitative level howa deeper network achieve better results. Though, on unet5b (not shown), thesegmentation is slightly worse than unet4b, so there is a limit to how deep thenetwork can be efficiently trained.

Figure 5.33: Left-to-right: input/label, unet1, unet3b, unet3c-lrnonly. Top-to-bottom (first column): input picture, given label, optimal segmentation accordingto label. Top-to-bottom (next columns): segmentation pattern of foreground class(green: true positives, black: true negatives, red: false positive, magenta: falsenegative), predicted mask, resulting segmentation applied to source image.

Beyond unet4b, a different training strategy should probably be used: very deepnetworks need to be trained separately layer by layer, starting from the layersclosest to inputs. Once the active layer is trained, it is frozen as well as previouslayers and training is started again on the next layer, until all of them are trained.Finally, a full training can be performed on the pre-trained topology. That kindof advanced training strategy is not put into practice in this project.

A. Ducommun dit Boudry 76

Page 79: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.6.4 Tea Leaves Dataset

Testing Activation Functions

Parameter Valuebatch size 50steps 25max. epochs 25dataset mediumtopology unet3boptimizer nadamlearning rate 0.001 (constant)loss function jaccardkernel initializer he (normal)lrn activebatch-norm activedropout 50%

Table 5.9: Tea leaves: fixed hyper-parameters (1)

Activation E.S. Precision Recall F1 IOUFG BG FG BG FG BG

crelu 7 3.0% 99.7% 73.1% 81.0% 0.06 0.89 0.419elu 16 3.0% 99.8% 75.8% 80.1% 0.06 0.89 0.415lrelu 6 3.6% 99.6% 60.5% 86.6% 0.07 0.93 0.449none 12 2.8% 99.7% 75.0% 78.7% 0.05 0.88 0.406relu 12 2.5% 99.8% 79.6% 75.0% 0.05 0.86 0.387relu6 12 2.7% 99.8% 78.9% 76.6% 0.05 0.87 0.396selu 16 2.8% 99.8% 77.1% 78.3% 0.05 0.88 0.405sigmoid 2 0.8% 100.0% 99.9% 4.2% 0.02 0.08 0.025softmax 7 1.0% 99.9% 98.9% 17.4% 0.02 0.30 0.092softplus 7 2.0% 99.8% 84.7% 65.4% 0.04 0.79 0.336softsign 7 2.2% 99.8% 80.2% 71.3% 0.04 0.83 0.367tanh 11 2.9% 99.7% 74.7% 79.9% 0.06 0.89 0.413

Table 5.10: Tea leaves: results by activation function (E.S. is the early-stopepoch).

The F1 score of both classes should be as high as possible. That excludes sigmoid,softmax. These activation functions could surely produce better results in adifferent topology or by selecting other hyper-parameters. Some functions arediscarded because they are not among the top choices: softplus and softsign.The tanh is an interesting candidate but has a saturating output which might bea problem in deeper networks, same goes for relu6. The computation complexityof crelu is doubled, because it outputs two channels and the gain is not visible inthis problem. No activation function is a valid option but it converges slowly.That leaves elu, leaky-relu, relu and selu. As leaky-relu can avoid the ”Dying-ReLU” problem, it is preferred over standard ReLU. Also it is simpler to computethan elu or selu, therefore it is chosen as the layer’s activation function.

A. Ducommun dit Boudry 77

Page 80: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Testing Loss Functions

Parameter Valuebatch size 50steps 25max. epochs 25dataset mediumtopology unet3boptimizer nadamlearning rate 0.001 (constant)activation function leaky-relukernel initializer he (normal)lrn activebatch-norm activedropout 50%

Table 5.11: Tea leaves: fixed hyper-parameters (2)

Loss E.S. Precision Recall F1 IOUFG BG FG BG FG BG

abs 19 1.0% 100.0% 99.3% 16.5% 0.02 0.28 0.087hinge 7 2.1% 99.8% 86.3% 67.4% 0.04 0.80 0.347huber 1 1.1% 99.4% 61.8% 53.0% 0.02 0.69 0.269jaccard 5 3.1% 99.7% 65.5% 83.6% 0.06 0.91 0.432log 14 1.7% 99.9% 90.5% 58.1% 0.03 0.73 0.299mse 25 1.0% 100.0% 99.7% 17.3% 0.02 0.30 0.091sigmoid 8 2.6% 99.8% 78.6% 76.2% 0.05 0.86 0.393softmax 8 1.9% 99.8% 85.8% 64.5% 0.04 0.78 0.332sparsesoftmax 14 1.8% 99.9% 89.0% 61.4% 0.04 0.76 0.316

Table 5.12: Tea leaves: results by loss function (E.S. is the early-stop epoch).

The absolute difference, mean square error and huber loss functions are discardeddue to poor performance. They also suffer from slower convergence towards theoptimum than other functions.The jaccard loss function is selected, because it is performing better and is alsomore adapted in theory to image segmentation problems.

A. Ducommun dit Boudry 78

Page 81: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Testing Kernel Initializers

Parameter Valuebatch size 50steps 25max. epochs 25dataset mediumtopology unet3boptimizer nadamlearning rate 0.001 (constant)loss function jaccardactivation function leaky-relulrn activebatch-norm activedropout 50%

Table 5.13: Tea leaves: fixed hyper-parameters (3)

Activation E.S. Precision Recall F1 IOUFG BG FG BG FG BG

glorot_normal 6 4.7% 99.6% 57.9% 90.5% 0.09 0.95 0.474glorot_uniform 7 2.4% 99.8% 79.3% 73.3% 0.05 0.84 0.377he_normal 7 2.5% 99.8% 81.9% 73.6% 0.05 0.85 0.380he_uniform 12 2.7% 99.8% 76.8% 77.2% 0.05 0.87 0.399lecun_normal 5 3.3% 99.7% 65.4% 84.4% 0.06 0.91 0.437lecun_uniform 7 2.2% 99.8% 80.8% 70.7% 0.04 0.83 0.364random_normal 19 1.0% 100.0% 99.8% 22.0% 0.02 0.36 0.115random_uniform 25 1.0% 100.0% 99.5% 17.9% 0.02 0.30 0.094truncated_normal 25 1.5% 99.9% 93.9% 51.4% 0.03 0.68 0.264

Table 5.14: Tea leaves: results by kernel initializers (E.S. is the early-stop epoch).

Among the kernel initialization strategies, both lecun and glorot methods of pick-ing values from a normal distribution are producing good results. Alternatively,he could also be a good choice. As expected, the default initialization strategiesperform poorly.Because glorot_normal has a slight advantage, it is selected as the main strategyfor the final experiment.

A. Ducommun dit Boudry 79

Page 82: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Testing Normalization Strategies

Parameter Valuebatch size 50steps 25max. epochs 25dataset mediumtopology unet3boptimizer nadamloss function jaccardactivation function leaky-relukernel initializer he (normal)

Table 5.15: Tea leaves: fixed hyper-parameters (4)

Strategy E.S. Precision Recall F1 IOUFG BG FG BG FG BG

bn 13 1.9% 99.7% 70.2% 70.7% 0.04 0.83 0.362bndropout 7 1.4% 99.7% 83.2% 52.1% 0.03 0.68 0.267dropout 25 0.8% 0.0% 100.0% 0.0% 0.02 0.00 0.004lrn 25 4.3% 99.9% 86.2% 84.3% 0.08 0.91 0.442lrnbndropout 6 3.7% 99.7% 68.8% 85.4% 0.07 0.92 0.444lrndropout 23 2.7% 99.8% 85.6% 75.3% 0.05 0.86 0.390nonorm 23 4.2% 99.8% 84.0% 84.3% 0.08 0.91 0.442

Table 5.16: Tea leaves: results by normalization strategy (E.S. is the early-stopepoch).

Among the normalization strategies, LRN (Local Response Strategy) alone per-forms the best with respect to IOU and F1 scores. LRN combined with BN(Batch Normalization) and dropout (50%), enhances the IOU scores slightly. Itshould be noted that having no normalization performs also well on this topology.LRN strategy (lrn) is selected with optionally BN and dropout (lrnbndropout):these two options will be evaluated in the final experiment.

A. Ducommun dit Boudry 80

Page 83: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Testing Optimizers

Parameter Valuebatch size 50steps 25max. epochs 25dataset mediumtopology unet3blearning rate 0.001 (constant)loss function jaccardactivation function leaky-relukernel initializer he (normal)lrn activebatch-norm activedropout 50%

Table 5.17: Tea leaves: fixed hyper-parameters (5)

Optimizer E.S. Precision Recall F1 IOUFG BG FG BG FG BG

adam 7 3.6% 99.7% 65.8% 85.8% 0.07 0.92 0.446gd 1 0.6% 99.0% 28.1% 59.9% 0.01 0.75 0.301momentum 1 0.6% 99.0% 28.5% 59.6% 0.01 0.74 0.299nadam 7 3.4% 99.7% 68.1% 84.1% 0.06 0.91 0.436nesterov 1 0.6% 99.0% 28.5% 59.5% 0.01 0.74 0.299rmsprop 12 2.7% 99.8% 76.8% 77.8% 0.05 0.87 0.402

Table 5.18: Tea leaves: results by optimizer (E.S. is the early-stop epoch).

The adam and nadam optimizers are the fastest to converge towards the opti-mum. As expected, gd (Gradient Descent) has very limited success on this prob-lem. The momentum and nesterov have an additional hyper-parameter, namelythe momentum accumulation factor, which by default is set to 0.5. It could beadapted to generate better results.The top-two optimizers will be evaluated in the final experiment.

A. Ducommun dit Boudry 81

Page 84: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Testing Learning Rates

Parameter Valuebatch size 50steps 25max. epochs 25dataset mediumtopology unet3boptimizer nadamloss function jaccardactivation function leaky-relukernel initializer he (normal)lrn activebatch-norm activedropout 50%

Table 5.19: Tea leaves: fixed hyper-parameters (6)

L. rate E.S. Precision Recall F1 IOUFG BG FG BG FG BG

cos_decay 25 1.4% 99.9% 91.4% 46.5% 0.03 0.63 0.239const(0.1) 2 7.9% 99.5% 35.3% 96.7% 0.13 0.98 0.516const(0.01) 1 2.5% 99.8% 80.2% 75.0% 0.05 0.86 0.387const(0.001) 7 3.0% 99.7% 73.9% 80.7% 0.06 0.89 0.417const(0.0001) 25 2.6% 99.7% 67.4% 79.3% 0.05 0.88 0.408exp_decay 7 3.0% 99.7% 73.0% 80.9% 0.06 0.89 0.418idt_decay 7 2.9% 99.8% 75.7% 79.0% 0.06 0.88 0.409lincos_decay 16 3.8% 99.7% 65.9% 86.3% 0.07 0.93 0.449natexp_decay 7 3.1% 99.7% 71.6% 81.9% 0.06 0.90 0.424noisylincos_decay 16 3.7% 99.7% 66.7% 86.0% 0.07 0.92 0.447poly_decay 7 2.9% 99.7% 74.0% 79.8% 0.06 0.89 0.412

Table 5.20: Tea leaves: results by learning rate (E.S. is the early-stop epoch).

This is probably the most difficult parameter to choose. Also, the optimal mightvary according to which topology is being used and any of the other hyper-parameters. Even though the constant rate 0.1 has a higher foreground classprecision, the recall is poor. Decaying strategies have the added complexity oftweaking their respective hyper-parameters.The default constant learning rate of 0.001 produces conservative results whilekeeping the training time at a manageable level and has the merit of being sim-ple. Also by using adam optimizer, an adaptive learning rate strategy is alreadyincluded.

A. Ducommun dit Boudry 82

Page 85: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Testing Topologies

Here are the test results after training on 5’940 different training examples, usingan Amazon EC2 p3.2xlarge instance equipped of an Nvidia Tesla V100 (16GB):

Parameter Valuebatch size 50steps 25max. epochs 25dataset mediumoptimizer nadamlearning rate 0.001 (constant)loss function jaccardactivation function leaky-relukernel initializer glorot (normal)lrn activebatch-norm activedropout 50%

Table 5.21: Tea leaves: fixed hyper-parameters (7)

Topology T.T. E.S. Precision Recall F1 IOUmin. FG BG FG BG FG BG

unet1 6 16 1.7% 99.8% 86.8% 58.4% 0.03 0.74 0.300unet1b 14 23 3.0% 99.8% 80.0% 79.1% 0.06 0.88 0.410unet2 14 17 2.8% 99.6% 61.4% 82.4% 0.05 0.90 0.425unet2b 3 6 2.6% 99.7% 69.4% 78.7% 0.05 0.88 0.405unet3 8 21 2.3% 99.5% 45.4% 84.4% 0.04 0.91 0.431unet3b 3 4 3.7% 99.5% 48.2% 89.7% 0.07 0.94 0.464unet4 3 3 1.1% 99.4% 46.6% 66.8% 0.02 0.80 0.338unet4b 4 6 3.4% 99.7% 72.5% 83.2% 0.06 0.91 0.432unet5 10 16 2.1% 99.7% 77.3% 70.3% 0.04 0.82 0.361unet5b 4 4 2.2% 99.7% 76.4% 71.8% 0.04 0.84 0.369

Table 5.22: Tea leaves: results by topology (T.T. is training time, E.S. is theearly-stop epoch).

Topologies have an effect on the overall performances, and only a few were tested.After choosing a proper activation, loss function and optimizer for the problem,it is the most effective change with respect to precision, recall and IOU scores.The multi-size kernel approach unet#b has an edge over simpler architectures anda deeper network tends to capture more complex patterns and achieves higherprecision. With input images of 256x256 pixels, the unet3b performs the best,having a minimum feature map resolution of 32x32.

A. Ducommun dit Boudry 83

Page 86: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Testing Topologies at Scale

This last experiment was done on an Amazon EC2 instance equipped of a NvidiaTesla V100 GPU. The large dataset is used (55’440 different training instances)and so the results are not directly comparable with previous results. In particular,the large dataset adds more rotations (+/- 5 and 10 degrees) as well as morerandom crops.

Parameter Valuebatch size 50steps 100max. epochs 100dataset largeoptimizer adamlearning rate 0.001 (constant)loss function jaccardactivation function leaky-relukernel initializer glorot (normal)lrn activebatch-norm inactivedropout inactive

Table 5.23: Tea leaves: fixed hyper-parameters (8)

Topology T.T. E.S. Precision Recall F1 IOUmin. FG BG FG BG FG BG

default 33 24 8.4% 99.9% 90.1% 88.3% 0.15 0.94 0.483nadam 27 17 8.4% 99.9% 89.3% 88.4% 0.15 0.94 0.483lrnbndropout 4 1 4.5% 99.7% 80.5% 79.5% 0.08 0.88 0.418seed1 26 16 8.8% 99.8% 86.3% 89.3% 0.16 0.94 0.489seed2 10 5 6.4% 99.8% 89.1% 84.6% 0.12 0.92 0.454seed3 10 4 7.7% 99.8% 86.0% 87.7% 0.14 0.93 0.476seed4 26 16 8.5% 99.8% 84.4% 89.1% 0.15 0.94 0.486unet3c 9 9 7.0% 99.8% 87.1% 86.3% 0.13 0.93 0.466noweights 17 10 0.0% 98.8% 0.0% 100.0% 0.00 0.99 0.494large20n 13 6 7.4% 99.8% 83.4% 88.5% 0.14 0.94 0.478

Table 5.24: Tea leaves: final results (T.T. is training time, E.S. is the early-stopepoch).

The following additional experiments are done on the base configuration:

default: default hyper-parameters (no batch normalization nor dropout).

nadam: same as default, but uses nadam optimizer instead of adam. This opti-mizer doesn’t perform better than the standard adam on this problem.

lrnbndropout: same as default, except batch normalization and 50% dropout areenabled. Adding batch normalization and dropout decrease performances, but itmight produce a model that generalizes better to unknown future images.

A. Ducommun dit Boudry 84

Page 87: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

seed1 to seed4: same as default, but change the pseudo-random seed. Simplychanging the initial sequence used to do random initialization (kernels, etc) hasan effect on the final results. That is expected, because some particular intialvalues in the kernels are more suitable to the segmentation problem than others.Changing this parameter can be a way to estimate the variance of measurements.

unet3c: a topology similar to unet3b, but with additional layers. This topologyis more similar to the concept of inception modules in GoogLeNet, but doesn’tperform better in this setup.

noweights: disable per-pixel loss weighting. Without any weighting schema nor-malizing the distribution of foreground versus background pixel occurence, thetraining simply focus on the most occuring class, namely the background.

large20n: a different weighting schema, where borders have lower weight than thecenter of areas. This weighting strategy doesn’t improve the model.

After tweaking the hyper-parameters, the recall on both classes is around 85%and the precision on the background is almost 100%. The foreground per-pixelprecision with respect to the given labels is between 7 to 9%. The best model sofar is therefore able to find most young tea leaves in the pictures, though with areduced precision.

A. Ducommun dit Boudry 85

Page 88: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Visualizing Activations

As described above, TensorFlow offers the possibility to record user-defined met-rics and tensors during the training. They can be visualized in TensorBoard. Aselection of reports from topology unet3b can be found below.In this first figure, the rgb image, segmentation mask, loss weighting and predic-tions of a sample in the mini-batch of step 671 can be visualized:

Figure 5.34: From left-to-right: rgb input, segmentation mask, predictions (top:background, bottom: foreground) and loss weighting.

In this case, the segmentation is interesting, highlighting many potential areascorrectly. On a qualitative level, it seems the model is making good predictions.As it can be seen in the colored segmentation below, even though many areas aremarked in red (false positive), quite a few of these are arguably young tea leavescandidates, which simply haven’t been labeled as such.

Figure 5.35: Segmentation of the image above.

A. Ducommun dit Boudry 86

Page 89: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

The third figure shows the distribution of these tensors values over time. It isuseful to check if the prediction values stay bounded to some finite range and ifthe inputs follow the expected distribution:

Figure 5.36: Distribution of values over time, from left-to-right: rgb input, seg-mentation mask, predictions and loss weighting. On X-axis are step numbers andY-axis is the value range.

The fourth figure below displays the distribution of values found at the outputof the activation function of each layer in the topology. Again, it is a very usefulvisualization to ensure the values are within expected range and are not divergingover time:

Figure 5.37: Distribution of values over time: all 24 layers in topology unet3bafter activation function.

The next sequence of figures are specific outputs of a few layers after the activationfunction. All the layers can be visualized in the same way. This information isnot extremely useful has it is often difficult to understand what kind of operation

A. Ducommun dit Boudry 87

Page 90: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

the convolutions are performing. Nonetheless, it can be helpful to find out if thedeepest layers are producing interesting contrasted signals or not.

Figure 5.38: Activations of down1_1, down1_2, down1_3 (each layer outputsfour channels, eight in total due to the use of CReLU).

Figure 5.39: Activations of down2_1, down2_2, down2_3 (each layer outputs 12channels, 24 in total due to the use of CReLU).

Figure 5.40: Activations of down3_1, down3_2, down3_3 (each layer outputs 36channels, 72 in total due to the use of CReLU).

A. Ducommun dit Boudry 88

Page 91: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

Figure 5.41: Activations of down3b (this layer outputs 108 channels, 216 in totaldue to the use of CReLU).

A. Ducommun dit Boudry 89

Page 92: Deep-Learning Image Segmentation

CHAPTER 5. MACHINE-LEARNING APPLIED TO COMPUTER VISION

5.7 Conclusion

The problem of per-pixel image segmentation using a convolutional neural net-work has been studied in practice, on the basis of a flower and tea leaf seg-mentation datasets. The theoretical foundations have been studied as well aspre-existing architectures used in other contexts.The predictions returned on the flower dataset are close to perfection withoutmuch effort. It is an easy task comparatively to young tea leaves, because flowersstand out in pictures simply by their textures and colors. Segmenting young tealeaves is much more challenging and here are a couple reasons which can explainwhy it is so.What define young tea leaves suitable for a harvest is somewhat a subjectivecriteria. Indeed, after a couple hours of manual labor on tea fields, it becomesclear that not every human worker selects the exact same type of leaves andwith the same efficiency. That translates into difficulties to properly label thepictures. It can be expected that given a somewhat ambiguous labeling, themachine-learning training is limited in the accuracy it can achieve, and in anycase, it will be impossible to reach anywhere near 100% precision and recall.Additionally to the labeling ambiguity, the prepared dataset has a high degreeof freedom. The pictures are taken on different plant varieties and at differentmoments in the growth season. Comparing the performances on the mediumand large datasets, a higher degree of freedom (e.g. flipping, scales, rotations,brightness, contrast, etc) decreases the overall accuracy.A human might uses visual clues such as the size of leaves compared to thesurrounding ones or if the leaves are at the periphery of the tea bush. Thedataset doesn’t encode this information well or at all in the way it is prepared.A human might also rely on the sense of touch to feel how supple is the stem andleaves. Again, that is not something the model can learn in this experiment.It should be also noted that only a limited set of architecture patterns and hyper-parameters have been tried. The number of possible combinations is infinite andso a clever hyper-parameter optimization strategy should be developed.It seems that early spring tea flushes will be easier to segment. The youngleaves are pale yellow-green in color. This contrasts well with mature leaves fromprevious year which are dark green. Unfortunately, as mentioned in the previouschapter, this research started too late in Spring to capture a usable dataset ofthese conditions.To improve the segmentation accuracy in an industrial context, many constraintsshould be added, limiting the variance of parameters and therefore rendering thetraining much easier.

A. Ducommun dit Boudry 90

Page 93: Deep-Learning Image Segmentation

Chapter 6

Future Work

In this project, a lot of ground was covered on two fronts: first, the creationof a dataset specific to tea leaves segmentation in the context of harvesting andsecond, designing a machine-learning model to predict an image’s segmentation.At the conclusion of this research, both need further investigation to refine theprocesses and solve the remaining issues.Assuming young tea leaves can be segmented in the image, an estimation of the3d coordinates in real-time could be found relative to the camera by knowingthe camera intrinsics, the (u, v) coordinates in the image and the depth at thiscoordinates. The depth could be estimated in several ways, for instance by usinga lidar or a depth-sensing camera.A second machine-learning model could then be build with a focus on the specificarea were a harvestable leaf is likely to be found and estimate the followingvariables: the type of pick, the pose of the leaves and stem, as well as the idealcutting coordinates.More research needs to be done on the efficiency of the models. They wouldbe run on embedded systems installed in the robot, therefore they should beenergy-efficient and fit in the available CPU and RAM resources. On that topic,TensorFlow has a dedicated runtime called TensorFlow-lite which can executepredictions on resource-constrained hardware. That brings a couple of new chal-lenges. First, it reduces the list of available operations that can be performed ontensors so the models might need to be simplified to combine only supported op-erations. Secondly, the quantization of the neural network weights might reducethe accuracy of the models.Last but not least, the difficulty of building a robot, capable of precise actuationto cut stems at the right location, is a complex problem on its own. It willprobably need to rely on some kind of visual feedback loop to achieve stability androbustness. The latest work from OpenAI might be a good starting point [77].

A. Ducommun dit Boudry 91

Page 94: Deep-Learning Image Segmentation

Appendices

OpenGL Shaders

in vec3 vertex3;in vec3 color3;out vec4 color;void main() {

gl_Position = pvmMatrix * vec4(vertex3, 1.0);color = vec4(color3, 1.0);

}

Listing 3: GLSL 1.30 vertex shader sample

in vec4 color;out vec4 fragment;void main() {

fragment = color;gl_FragDepth = gl_FragCoord.z;

}

Listing 4: GLSL 1.30 fragment shader sample

A. Ducommun dit Boudry 92

Page 95: Deep-Learning Image Segmentation

CHAPTER 6. FUTURE WORK

MVE Stability Patch

double solve_undistorted_squared_radius (double const r2,double const k1, double const k2) {// Guess initial interval upper and lower bounddouble lbound = r2, ubound = r2;while (distort_squared_radius(lbound, k1, k2) > r2)

lbound /= 1.05;while (distort_squared_radius(ubound, k1, k2) < r2)

ubound *= 1.05;

// Perform a bisection until epsilon accuracy is reachedwhile (std::numeric_limits<double>::epsilon() < ubound - lbound) {

double const mid = 0.5 * (lbound + ubound);if (distort_squared_radius(mid, k1, k2) > r2)

ubound = mid;else

lbound = mid;}return 0.5 * (lbound + ubound);

}

Listing 5: MVE: before patch

double solve_undistorted_squared_radius (double const r2,double const k1, double const k2) {// Guess initial interval upper and lower bounddouble lbound = r2, ubound = r2;while (distort_squared_radius(lbound, k1, k2) > r2) {

ubound = lbound;lbound /= 1.05;

}while (distort_squared_radius(ubound, k1, k2) < r2) {

lbound = ubound;ubound *= 1.05;

}

// Perform a bisection until epsilon accuracy is reacheddouble mid = 0.5 * (lbound + ubound);while (mid != lbound && mid != ubound) {

if (distort_squared_radius(mid, k1, k2) > r2)ubound = mid;

elselbound = mid;

mid = 0.5 * (lbound + ubound);}return mid;

}

Listing 6: MVE: after patch

A. Ducommun dit Boudry 93

Page 96: Deep-Learning Image Segmentation

CHAPTER 6. FUTURE WORK

Topology Format

# square image sizeinput_size: 256

# rgb image inputinput_channels: 3

# input layer nameinput: x

# output layer nameoutput: up1

# layer definitionslayers:

# 256x256x3 -> 128x128x4down1:

input: xtype: conv2dparams:filters: 4kernel_size: 5strides: 2padding: same

# 128x128x4 -> 256x256x2up1:

input: down1type: conv2d_transposeparams:filters: 2kernel_size: 5strides: 2padding: same

Listing 7: YAML network topology definition

A. Ducommun dit Boudry 94

Page 97: Deep-Learning Image Segmentation

CHAPTER 6. FUTURE WORK

Training Parameters Format

# name of configuration setname: example

########################## Global defaultsglobal:

# batch sizebatch_size: 10# training steps per epochsteps: 100# training epochsepochs: 1# global model topologytopology: topologies/test.yaml# optimizeroptimizer: gd# loss functionloss: abs# global activation function:activation: relu# global kernel initializer function:initializer: he_normal

########################## Datasets useddataset:

# training datasettrain: dataset/example-train.tfr# evaluation dataseteval: dataset/example-eval.tfr

########################## Sessions configurationmodels:

config1:# override any of global configuration:optimizer: adamloss: mse

config2:optimizer: momentum:0.5loss: hinge

# ...

Listing 8: YAML training session parametrization

A. Ducommun dit Boudry 95

Page 98: Deep-Learning Image Segmentation

printemps 2018 Session de bachelor

INGÉNIERIE DES TECHNOLOGIES DE L’INFORMATION

ORIENTATION – INFORMATIQUE MATÉRIELLE

IDENTIFICATION AUTOMATIQUE DE FEUILLES DE THE A L'AIDE DES RESEAUX DEEP LEARNING CNN

Descriptif :

Le thé vert de meilleure qualité est produit à partir de jeunes feuilles de thé. Le processus de récolte est extrêmement laborieux car il doit être fait manuellement afin de respecter de critères de sélection.

L'objectif de ce projet est d'apporter une nouvelle solution à la problématique de la détection et de la localisation de jeunes feuilles de thé (Camellia Sinensis) par un système informatique. Actuellement, il n'existe aucun système capable de récolter de manière autonome les feuilles de thé afin de produire du thé vert de qualité. Dans le cadre de ce projet, une approche logicielle basée sur des algorithmes modernes de computer vision sera explorée, c'est-à-dire en utilisant les réseaux deep-learning CNN.

Il faudra qualifier l’environnement de travail d'une telle machine sur le terrain, proposer une méthodologie et les outils pour acquérir les données nécessaires à l'apprentissage et la mettre en œuvre, étudier les architectures de réseau neuronaux CNN et identifier celle qui serait la mieux adaptée au problème, implémenter l'architecture choisie, ingérer les données d’entraînement pour créer un modèle, et finalement évaluer les performances de celui-ci.

Travail demandé :

* Comprendre la problématique de la récolte des feuilles de thé, et proposer une solution

informatique.

* En partant de recherches existantes, identifier les aspects limitants et les possibilités d'amélioration.

* Acquérir les données libellées nécessaires à un apprentissage par un réseau de neurones.

* Implémenter, entraîner et valider une architecture de réseau de neurones.

Candidat : Professeur(s) responsable(s) :

M. DUCOMMUN DIT-BOUDRY Antony UPEGUI Andrés

Filière d’études : ITI En collaboration avec : Travail de bachelor soumis à une convention de stage en entreprise : non Travail de bachelor soumis à un contrat de confidentialité : non

Page 99: Deep-Learning Image Segmentation

printemps 2018Session de bachelor

Résumé :

La récolte des feuilles de thé en Chine est un travail laborieux, qui est effectué majoritairement àla main. En effet, la production de thé vert nécessite une sélection particulière de jeunes feuillesqui sont ensuite séchées. Une question qui se pose aujourd’hui en Chine, c’est de trouver unesolution à l’automatisation de cette tâche, sans pour autant sacrifier la qualité du produit final.

L’objectif de ce travail de recherche est d’explorerl’adéquation de l’utilisation d’un algorithmed’intelligence artificielle « Deep-Learning » afin desegmenter les régions où se trouvent ces feuilles. Leprojet est divisé en deux grands volets et a étéréalisé sur une période de cinq mois à Hangzhou,Zhejiang, China.

Premièrement, une méthodologie est développée surle terrain afin d’acquérir une collection d’image et leursegmentation respective en vue d’un apprentissagepar réseau neuronal.

A droite se trouve un exemple de la segmentationappliquée sur l’image (masque rose).

Deuxièmement, une étude de l’état del’art a été effectuée sur la segmentationd’image au pixel près par un réseauneuronal. Puis une architecture de réseaua été retenue et évaluée en utilisant unaccélérateur de calcul GPU (Nvidia TeslaK80 et V100).

La topologie du modèle choisi est baséesur l’architecture CNN appelée« U-NET ».

Les résultats sont comparés à unproblème connu et exploré au préalable,à savoir la segmentation de fleurs dansleur milieu naturel (jeu de donnée del’université d’Oxford).

A gauche est représenté « une couche »d’apprentissage qui est reliée enséquence afin de construire un réseauneuronal profond.

Candidat : Professeur(s) responsable(s) :

M. DUCOMMUN DIT BOUDRY Antony DR. UPEGUI Andres

Filière d’études : ITI En collaboration avec : Zhejiang UniversityTravail de bachelor soumis à une convention de stage en entreprise :nonnonTravail de bachelor soumis à un contrat de confidentialité : non

Page 100: Deep-Learning Image Segmentation

List of Tables

4.1 Ideal and calibrated intrinsics . . . . . . . . . . . . . . . . . . . . 234.2 MVE output files . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3 MVE view meta-data . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Generated datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1 Training options . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Amazon EC2 instances . . . . . . . . . . . . . . . . . . . . . . . . 685.3 Topology unet1.yaml . . . . . . . . . . . . . . . . . . . . . . . . . 725.4 Topology unet2.yaml . . . . . . . . . . . . . . . . . . . . . . . . . 725.5 Topology unet1b.yaml . . . . . . . . . . . . . . . . . . . . . . . . 735.6 Topology unet2b.yaml . . . . . . . . . . . . . . . . . . . . . . . . 735.7 Flower: fixed hyper-parameters . . . . . . . . . . . . . . . . . . . 755.8 Flower: results by topology . . . . . . . . . . . . . . . . . . . . . 755.9 Tea leaves: fixed hyper-parameters (1) . . . . . . . . . . . . . . . 775.10 Tea leaves: results by activation function . . . . . . . . . . . . . . 775.11 Tea leaves: fixed hyper-parameters (2) . . . . . . . . . . . . . . . 785.12 Tea leaves: results by loss function . . . . . . . . . . . . . . . . . 785.13 Tea leaves: fixed hyper-parameters (3) . . . . . . . . . . . . . . . 795.14 Tea leaves: results by kernel initializers . . . . . . . . . . . . . . . 795.15 Tea leaves: fixed hyper-parameters (4) . . . . . . . . . . . . . . . 805.16 Tea leaves: results by normalization strategy . . . . . . . . . . . . 805.17 Tea leaves: fixed hyper-parameters (5) . . . . . . . . . . . . . . . 815.18 Tea leaves: results by optimizer . . . . . . . . . . . . . . . . . . . 815.19 Tea leaves: fixed hyper-parameters (6) . . . . . . . . . . . . . . . 825.20 Tea leaves: results by learning rate . . . . . . . . . . . . . . . . . 825.21 Tea leaves: fixed hyper-parameters (7) . . . . . . . . . . . . . . . 835.22 Tea leaves: results by topology . . . . . . . . . . . . . . . . . . . 835.23 Tea leaves: fixed hyper-parameters (8) . . . . . . . . . . . . . . . 845.24 Tea leaves: final results . . . . . . . . . . . . . . . . . . . . . . . . 84

A. Ducommun dit Boudry 98

Page 101: Deep-Learning Image Segmentation

List of Figures

1.1 Camellia Sinensis . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Chinese tea products . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Freshly plucked leaves . . . . . . . . . . . . . . . . . . . . . . . . 101.4 Pan-firing equipment . . . . . . . . . . . . . . . . . . . . . . . . . 101.5 Longjing green tea . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Typical tea plantation . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Plantation in rows . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Plantation in circular bushes . . . . . . . . . . . . . . . . . . . . . 13

3.1 Mechanical harvesting machine . . . . . . . . . . . . . . . . . . . 143.2 Mechanical harvesting machine types . . . . . . . . . . . . . . . . 18

4.1 Segmentation dataset . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Pinhole camera model . . . . . . . . . . . . . . . . . . . . . . . . 224.3 Calibration pattern on Sony RX100-II . . . . . . . . . . . . . . . 234.4 Stereo vision: epipolar line . . . . . . . . . . . . . . . . . . . . . . 244.5 3D point-cloud reconstructed with OpenMVG . . . . . . . . . . . 274.6 3D point-cloud reconstructed with MVE . . . . . . . . . . . . . . 284.7 OpenGL coordinate system . . . . . . . . . . . . . . . . . . . . . 314.8 Illustration of a K-d tree . . . . . . . . . . . . . . . . . . . . . . . 334.9 3D tagging: main window . . . . . . . . . . . . . . . . . . . . . . 344.10 3D tagging: view rendering . . . . . . . . . . . . . . . . . . . . . 354.11 3D tagging: selection rendering . . . . . . . . . . . . . . . . . . . 354.12 3D tagging: view rendering with picture . . . . . . . . . . . . . . 354.13 3D tagging: mask rendering . . . . . . . . . . . . . . . . . . . . . 354.14 2D tagging: main window . . . . . . . . . . . . . . . . . . . . . . 374.15 2D tagging: before . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A. Ducommun dit Boudry 99

Page 102: Deep-Learning Image Segmentation

LIST OF FIGURES

4.16 2D tagging: after . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.17 Dataset: high-level TensorFlow graph . . . . . . . . . . . . . . . . 414.18 Dataset: TensorFlow graph for a single variant . . . . . . . . . . . 414.19 Image variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.20 Flower dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 Illustration of DeepLearning . . . . . . . . . . . . . . . . . . . . . 445.2 Illustration of a digital neuron . . . . . . . . . . . . . . . . . . . . 455.3 Artificial neuronal network . . . . . . . . . . . . . . . . . . . . . . 455.4 2D convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.5 Dilated 2D convolution . . . . . . . . . . . . . . . . . . . . . . . . 475.6 2D convolution with stride . . . . . . . . . . . . . . . . . . . . . . 475.7 Transposed 2D convolution . . . . . . . . . . . . . . . . . . . . . 485.8 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 485.9 Fully Convolutional Neural Network . . . . . . . . . . . . . . . . 495.10 Inception module architecture . . . . . . . . . . . . . . . . . . . . 505.11 ResNet architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 505.12 U-Net architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 515.13 Inverse time learning schedule . . . . . . . . . . . . . . . . . . . . 545.14 Exponential learning schedule . . . . . . . . . . . . . . . . . . . . 545.15 Linear cosine learning schedule . . . . . . . . . . . . . . . . . . . 555.16 Rising exponential learning schedule . . . . . . . . . . . . . . . . 555.17 Learning rate heuristic . . . . . . . . . . . . . . . . . . . . . . . . 555.18 Heaviside activation . . . . . . . . . . . . . . . . . . . . . . . . . 575.19 Sigmoid activation . . . . . . . . . . . . . . . . . . . . . . . . . . 575.20 Hyperbolic tangent activation . . . . . . . . . . . . . . . . . . . . 585.21 ReLU activation . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.22 Leaky-ReLU activation . . . . . . . . . . . . . . . . . . . . . . . . 585.23 CReLU activation . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.24 Precision and recall . . . . . . . . . . . . . . . . . . . . . . . . . . 655.25 Intersection of two sets . . . . . . . . . . . . . . . . . . . . . . . . 665.26 Union of two sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.27 TensorFlow model graph . . . . . . . . . . . . . . . . . . . . . . . 695.28 Simple down-scaling module . . . . . . . . . . . . . . . . . . . . . 705.29 Down-scaling module with varying kernel sizes . . . . . . . . . . . 71

A. Ducommun dit Boudry 100

Page 103: Deep-Learning Image Segmentation

LIST OF FIGURES

5.30 Up-scaling module with varying kernel sizes . . . . . . . . . . . . 715.31 Unet2 topology in TensorFlow . . . . . . . . . . . . . . . . . . . . 725.32 Unet1b topology in TensorFlow . . . . . . . . . . . . . . . . . . . 735.33 TensorBoard: flower segmentation . . . . . . . . . . . . . . . . . . 765.34 TensorBoard: inputs and outputs . . . . . . . . . . . . . . . . . . 865.35 TensorBoard: segmentation . . . . . . . . . . . . . . . . . . . . . 865.36 TensorBoard: input and output distributions . . . . . . . . . . . . 875.37 TensorBoard: layer activation distributions . . . . . . . . . . . . . 875.38 TensorBoard: first layer activations . . . . . . . . . . . . . . . . . 885.39 TensorBoard: second layer activations . . . . . . . . . . . . . . . 885.40 TensorBoard: third layer activations . . . . . . . . . . . . . . . . 885.41 TensorBoard: third layer activations . . . . . . . . . . . . . . . . 89

A. Ducommun dit Boudry 101

Page 104: Deep-Learning Image Segmentation

Bibliography

[1] Mary Lou Heiss and Robert J. Heiss. The Tea Enthusiast’s Handbook:A Guide to the World’s Best Teas. Ten Speed Press, 2010. ISBN 978-1580088046.

[2] Wikipedia. Tea, July 2018. URL https://en.wikipedia.org/wiki/Tea.

[3] Wikipedia. Camellia sinensis, July 2018. URL https://en.wikipedia.org/wiki/Camellia_sinensis.

[4] Wikipedia. History of tea, July 2018. URL https://en.wikipedia.org/wiki/History_of_tea.

[5] Wikipedia. Chinese tea culture, July 2018. URL https://en.wikipedia.org/wiki/Chinese_tea_culture.

[6] Iris MacFarlane and Alan MacFarlane. The Empire of Tea. The OverlookPress, 2009. ISBN 978-1590201756.

[7] Allied Market Research. Tea market size, share, trends and industry analysis,July 2018. URL https://www.alliedmarketresearch.com/tea-market.

[8] Transparency Market Research. Tea market by product, end use andforecast, July 2018. URL https://www.transparencymarketresearch.com/global-tea-market.html.

[9] Max Tillberg. Zhejiang | international tea database, July 2018. URL http://www.teadatabase.com/listing/zhejiang/.

[10] Travel China Guide. Hangzhou weather, July 2018. URL https://www.travelchinaguide.com/climate/hangzhou.htm.

[11] National Bureau of Statistics of China. China statistical yearbook 2017, July2018. URL http://www.stats.gov.cn/tjsj/ndsj/2017/indexeh.htm.

[12] National Bureau of Statistics of China. China statistical yearbook 2003, July2018. URL http://www.stats.gov.cn/english/statisticaldata/yearlydata/yarbook2003_e.pdf.

[13] Adam N. Mayer. China regional urbanization trends: 2014 edi-tion, July 2018. URL http://www.chinaurbandevelopment.com/regional-urbanization-trends-2014-edition/.

A. Ducommun dit Boudry 102

Page 105: Deep-Learning Image Segmentation

BIBLIOGRAPHY

[14] Ministry of Education of the People’s Republic of China. Gross enrolmentrate of education by level, July 2018. URL http://www.moe.gov.cn/s78/A03/moe_560/jytjsj_2014/2014_qg/201509/t20150901_204903.html.

[15] Jiajia Wei. Researches on high-quality tea flushes identification formechanical-plucking. 2012.

[16] Yu Han, Hongru Xiao, Guangming Qin, Zhiyu Song, Wenqin Ding, and SongMei. Developing situations of tea plucking machine. 2014.

[17] Jun Chen, Yong Chen, Xiaojun Jin, Jun Che, Feng Gao, and Nan Li. Re-search on a parallel robot for green tea flushes plucking. 2015.

[18] A. Sureshkumar and S. Muruganand. Design and development of selectivetea leaf plucking robot. 2014.

[19] Muhd Safarudin Chek Mat, Jezan Md Diah, Mokhtar Azizi Mohd Din, andAbd. Manan Samad. Data acquisition and representation of leaves usingdigital close range photogrammetry for species identification. 2014.

[20] W. Zhang, H. Wang, G. Zhou, and G. Yan. Corn 3d reconstruction withphotogrammetry. 2008.

[21] M. A. Aguilar, J.L. Pozo, F.J. Aguilar, J. Sanchez-Hermosilla, F.C. Páez,and J. Negreiros. 3d surface modelling of tomato plants using close-range.2008.

[22] Dionisio Andújar, Mikel Calle, and José Dorado. Three-dimensional model-ing of weed plants using low-cost photogrammetry. 2018.

[23] Franck Golbach, Gert Kootstra, Sanja Damjanovic, Gerwoud Otten, andRick van de Zedde. Validation of plant part measurements using a 3d recon-struction method suitable for high-throughput seedling phenotyping. 2015.

[24] Shubhra Aich and Ian Stavness. Leaf counting with deep convolutional anddeconvolutional networks. 2017.

[25] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Com-puter Vision. Cambridge University Press, 2004. ISBN 978-0521540513.

[26] OpenCV. Camera calibration and 3D reconstruction, July 2018. URLhttps://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html.

[27] OpenCV. Open source computer vision library, July 2018. URL https://opencv.org.

[28] MRPT. Camera-calib, July 2018. URL https://www.mrpt.org/list-of-mrpt-apps/application_camera-calib/.

[29] Wikipedia. Triangulation, July 2018. URL https://en.wikipedia.org/wiki/Triangulation_(computer_vision).

A. Ducommun dit Boudry 103

Page 106: Deep-Learning Image Segmentation

BIBLIOGRAPHY

[30] David G. Lowe. Object recognition from local scale-invariant features. 1999.

[31] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speededup robust features. 2008.

[32] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. ORB:an efficient alternative to SIFT or SURF. 2011.

[33] Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew W.Fitzgibbon. Bundle adjustment - a modern synthesis. 1999.

[34] Wikipedia. Bundle adjustment, July 2018. URL https://en.wikipedia.org/wiki/Bundle_adjustment.

[35] Valve Software. Advanced outdoors photogrammetry, July 2018.URL https://developer.valvesoftware.com/wiki/Destinations/Advanced_Outdoors_Photogrammetry.

[36] OpenMVG. OpenMVG documentation, July 2018. URL https://openmvg.readthedocs.io/en/latest/.

[37] EDF. CloudCompare, July 2018. URL https://www.cloudcompare.org.

[38] ISTI-CNR. MeshLab, July 2018. URL http://www.meshlab.net.

[39] Simon Fuhrmann, Fabian Langguth, and Michael Goesele. MVE - a multi-view reconstruction environment. 2014.

[40] TU Darmstadt. Multi-view environment, July 2018. URL https://www.gcc.tu-darmstadt.de/home/proj/mve/.

[41] Samir Aroudj, Patrick Seemann, Fabian Langguth, Stefan Guthe, andMichael Goesele. Visibility-consistent thin surface reconstruction usingmulti-scale kernels. 2017.

[42] Khronos. OpenGL gluPerspective, July 2018. URL https://www.khronos.org/registry/OpenGL-Refpages/gl2.1/xhtml/gluPerspective.xml.

[43] Khronos. OpenGL gluProject, July 2018. URL https://www.khronos.org/registry/OpenGL-Refpages/gl2.1/xhtml/gluProject.xml.

[44] Khronos. OpenGL gluUnProject4, July 2018. URL https://www.khronos.org/registry/OpenGL-Refpages/gl2.1/xhtml/gluUnProject4.xml.

[45] Google. TensorFlow, July 2018. URL https://www.tensorflow.org.

[46] Maria-Elena Nilsback and Andrew Zisserman. 102 category flower dataset,July 2018. URL http://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html.

[47] Michael Nielsen. Neural networks and deep learning, July 2018. URL http://neuralnetworksanddeeplearning.com.

A. Ducommun dit Boudry 104

Page 107: Deep-Learning Image Segmentation

BIBLIOGRAPHY

[48] Frank Rosenblatt. The perceptron - a perceiving and recognizing automaton.1957.

[49] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learningrepresentations by back-propagating errors. 1986.

[50] Yann LeCun et al. Backpropagation applied to handwritten zip code recog-nition. 1989.

[51] LISA lab. Convolution arithmetic tutorial, July 2018. URL http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html.

[52] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutionalnetworks for semantic segmentation. 2014.

[53] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Ra-binovich. Going deeper with convolutions. 2015.

[54] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residuallearning for image recognition. 2015.

[55] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutionalnetworks for biomedical image segmentation. 2015.

[56] Hanxiao Liu, Karen Simonyan, and Yiming Yuang. Darts: Differentiablearchitecture search. 2018.

[57] Atiqur Rahman and Yang Wang. Optimizing intersection-over-union in deepneural networks for image segmentation. 2016.

[58] Boris T. Polyak. Some methods of speeding up the convergence of iterationmethods. 1964.

[59] Yurii Nesterov. A method for solving the convex programming problem withconvergence rate o(1/k2̂). 1983.

[60] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methodsfor online learning and stochastic optimization. 2010.

[61] Tijmen Tieleman and Geoffrey Hinton. Coursera slides on neural networks.2012.

[62] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic opti-mization. 2014.

[63] Timothy Dozat. Incorporating nesterov momentum into adam. 2015.

[64] Richard H.R. Hahnloser, Rahul Sarpeshkar, Misha A. Mahowald, Rodney J.Douglas, and H. Sebastian Seung. Digital selection and analogue amplifica-tion coexist in a cortex-inspired silicon circuit. 2000.

A. Ducommun dit Boudry 105

Page 108: Deep-Learning Image Segmentation

BIBLIOGRAPHY

[65] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifierneural networks. 2011.

[66] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Under-standing and improving convolutional neural networks via concatenated rec-tified linear units. 2016.

[67] Yann LeCun, Leon Bottou, Genevieve B. Orr, and Klaus-Robert Müller.Efficient backprop. 1998.

[68] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep intorectifiers: Surpassing human-level performance on imagenet classification.2015.

[69] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of trainingdeep feedforward neural networks. 2010.

[70] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty oftraining recurrent neural networks. 2013.

[71] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classifi-cation with deep convolutional neural networks. 2017.

[72] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. 2015.

[73] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, andRuslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. 2012.

[74] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, andRuslan Salakhutdinov. Dropout: A simple way to prevent neural networksfrom overfitting. 2014.

[75] NumPy developers. NumPy, July 2018. URL http://www.numpy.org.

[76] Aurélien Géron. Hands-On Machine Learning with Scikit-Learn & Tensor-Flow. O’REILLz, 2017. ISBN 978-1491962299.

[77] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, JohnSchulman, Emanuel Todorov, and Sergey Levine. Learning complex dex-terous manipulation with deep reinforcement learning and demonstrations.2018.

A. Ducommun dit Boudry 106