synthesizing 3d morphology from a collection of urban …

55
SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN DESIGN CONCEPTS by Tuo Sun B.Arch - South China University of Technology, China (2014) M.Arch - University of California, Los Angeles (2015) Submitted to the Department of Architecture and the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degrees of Master of Science in Architecture Studies and Master of Science in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology February 2020 © 2020 Tuo Sun. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature of Author: ________________________________________________________________________ Department of Architecture Department of Electrical Engineering and Computer Science January 16, 2020 Certified by: ______________________________________________________________________________ Takehiko Nagakura Associate Professor of Design and Computation Thesis Supervisor Certified by: ______________________________________________________________________________ Terry Knight William and Emma Rogers Professor Professor of Design and Computation Thesis Supervisor Accepted by:______________________________________________________________________________ Leslie K. Norford Professor of Building Technology Chair of the Department Committee on Graduate Students Accepted by:______________________________________________________________________________ Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair of the Department Committee on Graduate Students

Upload: others

Post on 03-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN DESIGN CONCEPTS

by

Tuo Sun

BArch - South China University of Technology China (2014) MArch - University of California Los Angeles (2015)

Submitted to the Department of Architecture and the

Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degrees of

Master of Science in Architecture Studies

and Master of Science in Electrical Engineering and Computer Science

at the Massachusetts Institute of Technology

February 2020 copy 2020 Tuo Sun All rights reserved

The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of

this thesis document in whole or in part in any medium now known or hereafter created

Signature of Author ________________________________________________________________________ Department of Architecture

Department of Electrical Engineering and Computer Science January 16 2020

Certified by ______________________________________________________________________________ Takehiko Nagakura

Associate Professor of Design and Computation Thesis Supervisor

Certified by ______________________________________________________________________________ Terry Knight

William and Emma Rogers Professor Professor of Design and Computation

Thesis Supervisor

Accepted by______________________________________________________________________________ Leslie K Norford

Professor of Building Technology Chair of the Department Committee on Graduate Students

Accepted by______________________________________________________________________________ Leslie A Kolodziejski

Professor of Electrical Engineering and Computer Science Chair of the Department Committee on Graduate Students

2

_____________________________________________________________________________________

Takehiko Nagakura Associate Professor of Design and Computation

Thesis Supervisor

_____________________________________________________________________________________ Terry Knight

William and Emma Rogers Professor Professor of Design and Computation

Thesis Supervisor

_____________________________________________________________________________________ Justin Solomon

X-Window Consortium Career Development Professor Assistant Professor of Electrical Engineering and Computer Science

Thesis Reader

3

4

SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN DESIGN CONCEPTS

by

Tuo Sun

Submitted to the Department of Architecture and the Department of Electrical Engineering and Computer Science

on January 16 2020 in Partial Fulfillment of the Requirements for the Degrees of

Master of Science in Architecture Studies and Master of Science in Electrical Engineering and Computer Science

ABSTRACT In the decision-making process of urban development projects decision-makers and urban designers work collectively as a) decision-makers make decisions of urban development based on the evaluation of urban morphology b) urban designers visualize design decisions given by decision-makers with 3D urban morphology and produce development proposals after certain rounds of iteration A proposal involves designing 3D urban morphology aka the collection of building typologies (parcel level) on a specific site Due to the high costs of visualizing massive building geometries manually the current decision-making workflow does not allow adequate iteration before the implementation of the proposal To reduce the cost of manual modeling work by designers rule-based approaches (like ESRIrsquos CityEngine) generate 3D urban morphology from spatial geometries via rules However the limitations of creating rules are the bottleneck of popularizing rule-based approaches in professional practice This research explores using machine learning pipelines to synthesize novel 3D morphology from urban design precedents intuitively solving the above bottleneck The resulting pipeline learns spatial data and 2D rendering images for two major parts 1) to extract 2D building typology images from an aerial rendering image of urban morphology and 2) to predict spatial building data from an extracted image and a spatial parcel geometry This pipeline promotes the process of creating rules allowing both urban designers to create visualization and decision-makers to evaluate urban development intuitively Thesis Supervisor Takehiko Nagakura Title Associate Professor of Design and Computation Thesis Supervisor Terry Knight Title William and Emma Rogers Professor Professor of Design and Computation

5

6

ACKNOWLEDGMENTS The foundational motivation of this study comes from my long-term study and experience with rule-based generative design (eg CityEngine Grasshopper) in the Now Institute Southern California Associate of Governments (SCAG) China Academy of Urban Planning and Design (CAUPD) and MIT Special thanks to my committees who bring great help and patience to this study Professor Takehiko Nagekura Professor Terry Knight and Professor Justin Solomon plus my friends Wenzhe Peng and Yuehan Wang Thanks for general computational approaches provided by 3D-R2N2 Pixel2Mesh and Mask R-CNN This study implements their theories and extracts feasible methods for urban design machine learning tasks The idea of how urban morphology can be defined computationally is inspired during precious communications with specialists in various fields such as urban design urban planning computational design and computer graphics

7

8

TABLE OF CONTENTS

1 CONTEXT 11 Background helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(11) 12 Problemshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(12) 13 Possible approaches to improvehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(13)

2 HYPOTHESIShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(16)

3 PRECEDENTS

31 Collective decision-making toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(17) 32 Generative design toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18) 33 Related machine learning workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18)

4 SOLUTION AND METHODOLOGY

41 Solutionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(20) 42 Datasethelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(21) 43 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(22) 44 3D voxelmesh reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(23) 45 Spatial data reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(29)

5 RESULT AND EVALUATION

51 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 52 3D reconstruction resultshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 53 Other evaluationshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(40)

6 DELIVERABLE AND CONTRIBUTION

61 Deliverablehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(49) 62 Contributionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(50) 63 Future Workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(51)

7APPENDICEShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(52)

8BIBLIOGRAPHYhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(55)

9

10

1 CONTEXT 11 Background City planning committees which include elected officials (or their designees) specialists and resident representatives evaluate urban design proposals (made by urban designers) of development projects in specific locations An urban design proposal involves a 3D urban morphology (the combination of 3D building envelopes on blocks) to visualize spatial demographic data (eg residents jobs transportation) and building typology (on parcels) A qualified urban proposal requires multiple stakeholders in the city planning committee to negotiate their individual ideas such as population strategy landfill context and building design Urban designers collect and translate these ideas into a 3D urban morphology and visualize them Building a 3D urban morphology traditionally is time-consuming for urban designers due to the heavy manual modeling workload of massive building geometries Therefore itrsquos hard to execute enough rounds of iterations before a proposal truly meets all complicated requirements of the city planning committee To avoid the manual modeling work of deploying each building typology to every parcel of a site some rule-based approaches (such as ESRIrsquos CityEngine [1]) generate 3D urban morphology from a 2D site via rules The generative rule is programmed with built-in functions translating design language to a script involving parameters of building and zoning code When the rule performs well a rule-based approach can live up to its advertising However how to make creating rules intuitive for designers who donrsquot have coding skills is the bottleneck of rule-based systems with the result that designers in the real world still avoid using them for making 3D urban morphology

fig1 Some built-in generative functions provided by ESRI CityEngine

11

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 2: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

2

_____________________________________________________________________________________

Takehiko Nagakura Associate Professor of Design and Computation

Thesis Supervisor

_____________________________________________________________________________________ Terry Knight

William and Emma Rogers Professor Professor of Design and Computation

Thesis Supervisor

_____________________________________________________________________________________ Justin Solomon

X-Window Consortium Career Development Professor Assistant Professor of Electrical Engineering and Computer Science

Thesis Reader

3

4

SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN DESIGN CONCEPTS

by

Tuo Sun

Submitted to the Department of Architecture and the Department of Electrical Engineering and Computer Science

on January 16 2020 in Partial Fulfillment of the Requirements for the Degrees of

Master of Science in Architecture Studies and Master of Science in Electrical Engineering and Computer Science

ABSTRACT In the decision-making process of urban development projects decision-makers and urban designers work collectively as a) decision-makers make decisions of urban development based on the evaluation of urban morphology b) urban designers visualize design decisions given by decision-makers with 3D urban morphology and produce development proposals after certain rounds of iteration A proposal involves designing 3D urban morphology aka the collection of building typologies (parcel level) on a specific site Due to the high costs of visualizing massive building geometries manually the current decision-making workflow does not allow adequate iteration before the implementation of the proposal To reduce the cost of manual modeling work by designers rule-based approaches (like ESRIrsquos CityEngine) generate 3D urban morphology from spatial geometries via rules However the limitations of creating rules are the bottleneck of popularizing rule-based approaches in professional practice This research explores using machine learning pipelines to synthesize novel 3D morphology from urban design precedents intuitively solving the above bottleneck The resulting pipeline learns spatial data and 2D rendering images for two major parts 1) to extract 2D building typology images from an aerial rendering image of urban morphology and 2) to predict spatial building data from an extracted image and a spatial parcel geometry This pipeline promotes the process of creating rules allowing both urban designers to create visualization and decision-makers to evaluate urban development intuitively Thesis Supervisor Takehiko Nagakura Title Associate Professor of Design and Computation Thesis Supervisor Terry Knight Title William and Emma Rogers Professor Professor of Design and Computation

5

6

ACKNOWLEDGMENTS The foundational motivation of this study comes from my long-term study and experience with rule-based generative design (eg CityEngine Grasshopper) in the Now Institute Southern California Associate of Governments (SCAG) China Academy of Urban Planning and Design (CAUPD) and MIT Special thanks to my committees who bring great help and patience to this study Professor Takehiko Nagekura Professor Terry Knight and Professor Justin Solomon plus my friends Wenzhe Peng and Yuehan Wang Thanks for general computational approaches provided by 3D-R2N2 Pixel2Mesh and Mask R-CNN This study implements their theories and extracts feasible methods for urban design machine learning tasks The idea of how urban morphology can be defined computationally is inspired during precious communications with specialists in various fields such as urban design urban planning computational design and computer graphics

7

8

TABLE OF CONTENTS

1 CONTEXT 11 Background helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(11) 12 Problemshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(12) 13 Possible approaches to improvehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(13)

2 HYPOTHESIShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(16)

3 PRECEDENTS

31 Collective decision-making toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(17) 32 Generative design toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18) 33 Related machine learning workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18)

4 SOLUTION AND METHODOLOGY

41 Solutionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(20) 42 Datasethelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(21) 43 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(22) 44 3D voxelmesh reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(23) 45 Spatial data reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(29)

5 RESULT AND EVALUATION

51 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 52 3D reconstruction resultshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 53 Other evaluationshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(40)

6 DELIVERABLE AND CONTRIBUTION

61 Deliverablehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(49) 62 Contributionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(50) 63 Future Workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(51)

7APPENDICEShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(52)

8BIBLIOGRAPHYhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(55)

9

10

1 CONTEXT 11 Background City planning committees which include elected officials (or their designees) specialists and resident representatives evaluate urban design proposals (made by urban designers) of development projects in specific locations An urban design proposal involves a 3D urban morphology (the combination of 3D building envelopes on blocks) to visualize spatial demographic data (eg residents jobs transportation) and building typology (on parcels) A qualified urban proposal requires multiple stakeholders in the city planning committee to negotiate their individual ideas such as population strategy landfill context and building design Urban designers collect and translate these ideas into a 3D urban morphology and visualize them Building a 3D urban morphology traditionally is time-consuming for urban designers due to the heavy manual modeling workload of massive building geometries Therefore itrsquos hard to execute enough rounds of iterations before a proposal truly meets all complicated requirements of the city planning committee To avoid the manual modeling work of deploying each building typology to every parcel of a site some rule-based approaches (such as ESRIrsquos CityEngine [1]) generate 3D urban morphology from a 2D site via rules The generative rule is programmed with built-in functions translating design language to a script involving parameters of building and zoning code When the rule performs well a rule-based approach can live up to its advertising However how to make creating rules intuitive for designers who donrsquot have coding skills is the bottleneck of rule-based systems with the result that designers in the real world still avoid using them for making 3D urban morphology

fig1 Some built-in generative functions provided by ESRI CityEngine

11

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 3: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

_____________________________________________________________________________________

Takehiko Nagakura Associate Professor of Design and Computation

Thesis Supervisor

_____________________________________________________________________________________ Terry Knight

William and Emma Rogers Professor Professor of Design and Computation

Thesis Supervisor

_____________________________________________________________________________________ Justin Solomon

X-Window Consortium Career Development Professor Assistant Professor of Electrical Engineering and Computer Science

Thesis Reader

3

4

SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN DESIGN CONCEPTS

by

Tuo Sun

Submitted to the Department of Architecture and the Department of Electrical Engineering and Computer Science

on January 16 2020 in Partial Fulfillment of the Requirements for the Degrees of

Master of Science in Architecture Studies and Master of Science in Electrical Engineering and Computer Science

ABSTRACT In the decision-making process of urban development projects decision-makers and urban designers work collectively as a) decision-makers make decisions of urban development based on the evaluation of urban morphology b) urban designers visualize design decisions given by decision-makers with 3D urban morphology and produce development proposals after certain rounds of iteration A proposal involves designing 3D urban morphology aka the collection of building typologies (parcel level) on a specific site Due to the high costs of visualizing massive building geometries manually the current decision-making workflow does not allow adequate iteration before the implementation of the proposal To reduce the cost of manual modeling work by designers rule-based approaches (like ESRIrsquos CityEngine) generate 3D urban morphology from spatial geometries via rules However the limitations of creating rules are the bottleneck of popularizing rule-based approaches in professional practice This research explores using machine learning pipelines to synthesize novel 3D morphology from urban design precedents intuitively solving the above bottleneck The resulting pipeline learns spatial data and 2D rendering images for two major parts 1) to extract 2D building typology images from an aerial rendering image of urban morphology and 2) to predict spatial building data from an extracted image and a spatial parcel geometry This pipeline promotes the process of creating rules allowing both urban designers to create visualization and decision-makers to evaluate urban development intuitively Thesis Supervisor Takehiko Nagakura Title Associate Professor of Design and Computation Thesis Supervisor Terry Knight Title William and Emma Rogers Professor Professor of Design and Computation

5

6

ACKNOWLEDGMENTS The foundational motivation of this study comes from my long-term study and experience with rule-based generative design (eg CityEngine Grasshopper) in the Now Institute Southern California Associate of Governments (SCAG) China Academy of Urban Planning and Design (CAUPD) and MIT Special thanks to my committees who bring great help and patience to this study Professor Takehiko Nagekura Professor Terry Knight and Professor Justin Solomon plus my friends Wenzhe Peng and Yuehan Wang Thanks for general computational approaches provided by 3D-R2N2 Pixel2Mesh and Mask R-CNN This study implements their theories and extracts feasible methods for urban design machine learning tasks The idea of how urban morphology can be defined computationally is inspired during precious communications with specialists in various fields such as urban design urban planning computational design and computer graphics

7

8

TABLE OF CONTENTS

1 CONTEXT 11 Background helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(11) 12 Problemshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(12) 13 Possible approaches to improvehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(13)

2 HYPOTHESIShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(16)

3 PRECEDENTS

31 Collective decision-making toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(17) 32 Generative design toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18) 33 Related machine learning workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18)

4 SOLUTION AND METHODOLOGY

41 Solutionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(20) 42 Datasethelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(21) 43 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(22) 44 3D voxelmesh reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(23) 45 Spatial data reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(29)

5 RESULT AND EVALUATION

51 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 52 3D reconstruction resultshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 53 Other evaluationshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(40)

6 DELIVERABLE AND CONTRIBUTION

61 Deliverablehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(49) 62 Contributionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(50) 63 Future Workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(51)

7APPENDICEShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(52)

8BIBLIOGRAPHYhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(55)

9

10

1 CONTEXT 11 Background City planning committees which include elected officials (or their designees) specialists and resident representatives evaluate urban design proposals (made by urban designers) of development projects in specific locations An urban design proposal involves a 3D urban morphology (the combination of 3D building envelopes on blocks) to visualize spatial demographic data (eg residents jobs transportation) and building typology (on parcels) A qualified urban proposal requires multiple stakeholders in the city planning committee to negotiate their individual ideas such as population strategy landfill context and building design Urban designers collect and translate these ideas into a 3D urban morphology and visualize them Building a 3D urban morphology traditionally is time-consuming for urban designers due to the heavy manual modeling workload of massive building geometries Therefore itrsquos hard to execute enough rounds of iterations before a proposal truly meets all complicated requirements of the city planning committee To avoid the manual modeling work of deploying each building typology to every parcel of a site some rule-based approaches (such as ESRIrsquos CityEngine [1]) generate 3D urban morphology from a 2D site via rules The generative rule is programmed with built-in functions translating design language to a script involving parameters of building and zoning code When the rule performs well a rule-based approach can live up to its advertising However how to make creating rules intuitive for designers who donrsquot have coding skills is the bottleneck of rule-based systems with the result that designers in the real world still avoid using them for making 3D urban morphology

fig1 Some built-in generative functions provided by ESRI CityEngine

11

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 4: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

4

SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN DESIGN CONCEPTS

by

Tuo Sun

Submitted to the Department of Architecture and the Department of Electrical Engineering and Computer Science

on January 16 2020 in Partial Fulfillment of the Requirements for the Degrees of

Master of Science in Architecture Studies and Master of Science in Electrical Engineering and Computer Science

ABSTRACT In the decision-making process of urban development projects decision-makers and urban designers work collectively as a) decision-makers make decisions of urban development based on the evaluation of urban morphology b) urban designers visualize design decisions given by decision-makers with 3D urban morphology and produce development proposals after certain rounds of iteration A proposal involves designing 3D urban morphology aka the collection of building typologies (parcel level) on a specific site Due to the high costs of visualizing massive building geometries manually the current decision-making workflow does not allow adequate iteration before the implementation of the proposal To reduce the cost of manual modeling work by designers rule-based approaches (like ESRIrsquos CityEngine) generate 3D urban morphology from spatial geometries via rules However the limitations of creating rules are the bottleneck of popularizing rule-based approaches in professional practice This research explores using machine learning pipelines to synthesize novel 3D morphology from urban design precedents intuitively solving the above bottleneck The resulting pipeline learns spatial data and 2D rendering images for two major parts 1) to extract 2D building typology images from an aerial rendering image of urban morphology and 2) to predict spatial building data from an extracted image and a spatial parcel geometry This pipeline promotes the process of creating rules allowing both urban designers to create visualization and decision-makers to evaluate urban development intuitively Thesis Supervisor Takehiko Nagakura Title Associate Professor of Design and Computation Thesis Supervisor Terry Knight Title William and Emma Rogers Professor Professor of Design and Computation

5

6

ACKNOWLEDGMENTS The foundational motivation of this study comes from my long-term study and experience with rule-based generative design (eg CityEngine Grasshopper) in the Now Institute Southern California Associate of Governments (SCAG) China Academy of Urban Planning and Design (CAUPD) and MIT Special thanks to my committees who bring great help and patience to this study Professor Takehiko Nagekura Professor Terry Knight and Professor Justin Solomon plus my friends Wenzhe Peng and Yuehan Wang Thanks for general computational approaches provided by 3D-R2N2 Pixel2Mesh and Mask R-CNN This study implements their theories and extracts feasible methods for urban design machine learning tasks The idea of how urban morphology can be defined computationally is inspired during precious communications with specialists in various fields such as urban design urban planning computational design and computer graphics

7

8

TABLE OF CONTENTS

1 CONTEXT 11 Background helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(11) 12 Problemshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(12) 13 Possible approaches to improvehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(13)

2 HYPOTHESIShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(16)

3 PRECEDENTS

31 Collective decision-making toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(17) 32 Generative design toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18) 33 Related machine learning workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18)

4 SOLUTION AND METHODOLOGY

41 Solutionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(20) 42 Datasethelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(21) 43 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(22) 44 3D voxelmesh reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(23) 45 Spatial data reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(29)

5 RESULT AND EVALUATION

51 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 52 3D reconstruction resultshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 53 Other evaluationshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(40)

6 DELIVERABLE AND CONTRIBUTION

61 Deliverablehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(49) 62 Contributionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(50) 63 Future Workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(51)

7APPENDICEShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(52)

8BIBLIOGRAPHYhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(55)

9

10

1 CONTEXT 11 Background City planning committees which include elected officials (or their designees) specialists and resident representatives evaluate urban design proposals (made by urban designers) of development projects in specific locations An urban design proposal involves a 3D urban morphology (the combination of 3D building envelopes on blocks) to visualize spatial demographic data (eg residents jobs transportation) and building typology (on parcels) A qualified urban proposal requires multiple stakeholders in the city planning committee to negotiate their individual ideas such as population strategy landfill context and building design Urban designers collect and translate these ideas into a 3D urban morphology and visualize them Building a 3D urban morphology traditionally is time-consuming for urban designers due to the heavy manual modeling workload of massive building geometries Therefore itrsquos hard to execute enough rounds of iterations before a proposal truly meets all complicated requirements of the city planning committee To avoid the manual modeling work of deploying each building typology to every parcel of a site some rule-based approaches (such as ESRIrsquos CityEngine [1]) generate 3D urban morphology from a 2D site via rules The generative rule is programmed with built-in functions translating design language to a script involving parameters of building and zoning code When the rule performs well a rule-based approach can live up to its advertising However how to make creating rules intuitive for designers who donrsquot have coding skills is the bottleneck of rule-based systems with the result that designers in the real world still avoid using them for making 3D urban morphology

fig1 Some built-in generative functions provided by ESRI CityEngine

11

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 5: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN DESIGN CONCEPTS

by

Tuo Sun

Submitted to the Department of Architecture and the Department of Electrical Engineering and Computer Science

on January 16 2020 in Partial Fulfillment of the Requirements for the Degrees of

Master of Science in Architecture Studies and Master of Science in Electrical Engineering and Computer Science

ABSTRACT In the decision-making process of urban development projects decision-makers and urban designers work collectively as a) decision-makers make decisions of urban development based on the evaluation of urban morphology b) urban designers visualize design decisions given by decision-makers with 3D urban morphology and produce development proposals after certain rounds of iteration A proposal involves designing 3D urban morphology aka the collection of building typologies (parcel level) on a specific site Due to the high costs of visualizing massive building geometries manually the current decision-making workflow does not allow adequate iteration before the implementation of the proposal To reduce the cost of manual modeling work by designers rule-based approaches (like ESRIrsquos CityEngine) generate 3D urban morphology from spatial geometries via rules However the limitations of creating rules are the bottleneck of popularizing rule-based approaches in professional practice This research explores using machine learning pipelines to synthesize novel 3D morphology from urban design precedents intuitively solving the above bottleneck The resulting pipeline learns spatial data and 2D rendering images for two major parts 1) to extract 2D building typology images from an aerial rendering image of urban morphology and 2) to predict spatial building data from an extracted image and a spatial parcel geometry This pipeline promotes the process of creating rules allowing both urban designers to create visualization and decision-makers to evaluate urban development intuitively Thesis Supervisor Takehiko Nagakura Title Associate Professor of Design and Computation Thesis Supervisor Terry Knight Title William and Emma Rogers Professor Professor of Design and Computation

5

6

ACKNOWLEDGMENTS The foundational motivation of this study comes from my long-term study and experience with rule-based generative design (eg CityEngine Grasshopper) in the Now Institute Southern California Associate of Governments (SCAG) China Academy of Urban Planning and Design (CAUPD) and MIT Special thanks to my committees who bring great help and patience to this study Professor Takehiko Nagekura Professor Terry Knight and Professor Justin Solomon plus my friends Wenzhe Peng and Yuehan Wang Thanks for general computational approaches provided by 3D-R2N2 Pixel2Mesh and Mask R-CNN This study implements their theories and extracts feasible methods for urban design machine learning tasks The idea of how urban morphology can be defined computationally is inspired during precious communications with specialists in various fields such as urban design urban planning computational design and computer graphics

7

8

TABLE OF CONTENTS

1 CONTEXT 11 Background helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(11) 12 Problemshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(12) 13 Possible approaches to improvehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(13)

2 HYPOTHESIShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(16)

3 PRECEDENTS

31 Collective decision-making toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(17) 32 Generative design toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18) 33 Related machine learning workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18)

4 SOLUTION AND METHODOLOGY

41 Solutionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(20) 42 Datasethelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(21) 43 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(22) 44 3D voxelmesh reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(23) 45 Spatial data reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(29)

5 RESULT AND EVALUATION

51 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 52 3D reconstruction resultshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 53 Other evaluationshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(40)

6 DELIVERABLE AND CONTRIBUTION

61 Deliverablehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(49) 62 Contributionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(50) 63 Future Workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(51)

7APPENDICEShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(52)

8BIBLIOGRAPHYhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(55)

9

10

1 CONTEXT 11 Background City planning committees which include elected officials (or their designees) specialists and resident representatives evaluate urban design proposals (made by urban designers) of development projects in specific locations An urban design proposal involves a 3D urban morphology (the combination of 3D building envelopes on blocks) to visualize spatial demographic data (eg residents jobs transportation) and building typology (on parcels) A qualified urban proposal requires multiple stakeholders in the city planning committee to negotiate their individual ideas such as population strategy landfill context and building design Urban designers collect and translate these ideas into a 3D urban morphology and visualize them Building a 3D urban morphology traditionally is time-consuming for urban designers due to the heavy manual modeling workload of massive building geometries Therefore itrsquos hard to execute enough rounds of iterations before a proposal truly meets all complicated requirements of the city planning committee To avoid the manual modeling work of deploying each building typology to every parcel of a site some rule-based approaches (such as ESRIrsquos CityEngine [1]) generate 3D urban morphology from a 2D site via rules The generative rule is programmed with built-in functions translating design language to a script involving parameters of building and zoning code When the rule performs well a rule-based approach can live up to its advertising However how to make creating rules intuitive for designers who donrsquot have coding skills is the bottleneck of rule-based systems with the result that designers in the real world still avoid using them for making 3D urban morphology

fig1 Some built-in generative functions provided by ESRI CityEngine

11

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 6: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

6

ACKNOWLEDGMENTS The foundational motivation of this study comes from my long-term study and experience with rule-based generative design (eg CityEngine Grasshopper) in the Now Institute Southern California Associate of Governments (SCAG) China Academy of Urban Planning and Design (CAUPD) and MIT Special thanks to my committees who bring great help and patience to this study Professor Takehiko Nagekura Professor Terry Knight and Professor Justin Solomon plus my friends Wenzhe Peng and Yuehan Wang Thanks for general computational approaches provided by 3D-R2N2 Pixel2Mesh and Mask R-CNN This study implements their theories and extracts feasible methods for urban design machine learning tasks The idea of how urban morphology can be defined computationally is inspired during precious communications with specialists in various fields such as urban design urban planning computational design and computer graphics

7

8

TABLE OF CONTENTS

1 CONTEXT 11 Background helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(11) 12 Problemshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(12) 13 Possible approaches to improvehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(13)

2 HYPOTHESIShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(16)

3 PRECEDENTS

31 Collective decision-making toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(17) 32 Generative design toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18) 33 Related machine learning workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18)

4 SOLUTION AND METHODOLOGY

41 Solutionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(20) 42 Datasethelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(21) 43 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(22) 44 3D voxelmesh reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(23) 45 Spatial data reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(29)

5 RESULT AND EVALUATION

51 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 52 3D reconstruction resultshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 53 Other evaluationshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(40)

6 DELIVERABLE AND CONTRIBUTION

61 Deliverablehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(49) 62 Contributionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(50) 63 Future Workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(51)

7APPENDICEShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(52)

8BIBLIOGRAPHYhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(55)

9

10

1 CONTEXT 11 Background City planning committees which include elected officials (or their designees) specialists and resident representatives evaluate urban design proposals (made by urban designers) of development projects in specific locations An urban design proposal involves a 3D urban morphology (the combination of 3D building envelopes on blocks) to visualize spatial demographic data (eg residents jobs transportation) and building typology (on parcels) A qualified urban proposal requires multiple stakeholders in the city planning committee to negotiate their individual ideas such as population strategy landfill context and building design Urban designers collect and translate these ideas into a 3D urban morphology and visualize them Building a 3D urban morphology traditionally is time-consuming for urban designers due to the heavy manual modeling workload of massive building geometries Therefore itrsquos hard to execute enough rounds of iterations before a proposal truly meets all complicated requirements of the city planning committee To avoid the manual modeling work of deploying each building typology to every parcel of a site some rule-based approaches (such as ESRIrsquos CityEngine [1]) generate 3D urban morphology from a 2D site via rules The generative rule is programmed with built-in functions translating design language to a script involving parameters of building and zoning code When the rule performs well a rule-based approach can live up to its advertising However how to make creating rules intuitive for designers who donrsquot have coding skills is the bottleneck of rule-based systems with the result that designers in the real world still avoid using them for making 3D urban morphology

fig1 Some built-in generative functions provided by ESRI CityEngine

11

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 7: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

ACKNOWLEDGMENTS The foundational motivation of this study comes from my long-term study and experience with rule-based generative design (eg CityEngine Grasshopper) in the Now Institute Southern California Associate of Governments (SCAG) China Academy of Urban Planning and Design (CAUPD) and MIT Special thanks to my committees who bring great help and patience to this study Professor Takehiko Nagekura Professor Terry Knight and Professor Justin Solomon plus my friends Wenzhe Peng and Yuehan Wang Thanks for general computational approaches provided by 3D-R2N2 Pixel2Mesh and Mask R-CNN This study implements their theories and extracts feasible methods for urban design machine learning tasks The idea of how urban morphology can be defined computationally is inspired during precious communications with specialists in various fields such as urban design urban planning computational design and computer graphics

7

8

TABLE OF CONTENTS

1 CONTEXT 11 Background helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(11) 12 Problemshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(12) 13 Possible approaches to improvehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(13)

2 HYPOTHESIShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(16)

3 PRECEDENTS

31 Collective decision-making toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(17) 32 Generative design toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18) 33 Related machine learning workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18)

4 SOLUTION AND METHODOLOGY

41 Solutionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(20) 42 Datasethelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(21) 43 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(22) 44 3D voxelmesh reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(23) 45 Spatial data reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(29)

5 RESULT AND EVALUATION

51 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 52 3D reconstruction resultshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 53 Other evaluationshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(40)

6 DELIVERABLE AND CONTRIBUTION

61 Deliverablehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(49) 62 Contributionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(50) 63 Future Workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(51)

7APPENDICEShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(52)

8BIBLIOGRAPHYhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(55)

9

10

1 CONTEXT 11 Background City planning committees which include elected officials (or their designees) specialists and resident representatives evaluate urban design proposals (made by urban designers) of development projects in specific locations An urban design proposal involves a 3D urban morphology (the combination of 3D building envelopes on blocks) to visualize spatial demographic data (eg residents jobs transportation) and building typology (on parcels) A qualified urban proposal requires multiple stakeholders in the city planning committee to negotiate their individual ideas such as population strategy landfill context and building design Urban designers collect and translate these ideas into a 3D urban morphology and visualize them Building a 3D urban morphology traditionally is time-consuming for urban designers due to the heavy manual modeling workload of massive building geometries Therefore itrsquos hard to execute enough rounds of iterations before a proposal truly meets all complicated requirements of the city planning committee To avoid the manual modeling work of deploying each building typology to every parcel of a site some rule-based approaches (such as ESRIrsquos CityEngine [1]) generate 3D urban morphology from a 2D site via rules The generative rule is programmed with built-in functions translating design language to a script involving parameters of building and zoning code When the rule performs well a rule-based approach can live up to its advertising However how to make creating rules intuitive for designers who donrsquot have coding skills is the bottleneck of rule-based systems with the result that designers in the real world still avoid using them for making 3D urban morphology

fig1 Some built-in generative functions provided by ESRI CityEngine

11

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 8: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

8

TABLE OF CONTENTS

1 CONTEXT 11 Background helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(11) 12 Problemshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(12) 13 Possible approaches to improvehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(13)

2 HYPOTHESIShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(16)

3 PRECEDENTS

31 Collective decision-making toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(17) 32 Generative design toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18) 33 Related machine learning workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18)

4 SOLUTION AND METHODOLOGY

41 Solutionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(20) 42 Datasethelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(21) 43 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(22) 44 3D voxelmesh reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(23) 45 Spatial data reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(29)

5 RESULT AND EVALUATION

51 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 52 3D reconstruction resultshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 53 Other evaluationshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(40)

6 DELIVERABLE AND CONTRIBUTION

61 Deliverablehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(49) 62 Contributionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(50) 63 Future Workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(51)

7APPENDICEShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(52)

8BIBLIOGRAPHYhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(55)

9

10

1 CONTEXT 11 Background City planning committees which include elected officials (or their designees) specialists and resident representatives evaluate urban design proposals (made by urban designers) of development projects in specific locations An urban design proposal involves a 3D urban morphology (the combination of 3D building envelopes on blocks) to visualize spatial demographic data (eg residents jobs transportation) and building typology (on parcels) A qualified urban proposal requires multiple stakeholders in the city planning committee to negotiate their individual ideas such as population strategy landfill context and building design Urban designers collect and translate these ideas into a 3D urban morphology and visualize them Building a 3D urban morphology traditionally is time-consuming for urban designers due to the heavy manual modeling workload of massive building geometries Therefore itrsquos hard to execute enough rounds of iterations before a proposal truly meets all complicated requirements of the city planning committee To avoid the manual modeling work of deploying each building typology to every parcel of a site some rule-based approaches (such as ESRIrsquos CityEngine [1]) generate 3D urban morphology from a 2D site via rules The generative rule is programmed with built-in functions translating design language to a script involving parameters of building and zoning code When the rule performs well a rule-based approach can live up to its advertising However how to make creating rules intuitive for designers who donrsquot have coding skills is the bottleneck of rule-based systems with the result that designers in the real world still avoid using them for making 3D urban morphology

fig1 Some built-in generative functions provided by ESRI CityEngine

11

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 9: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

TABLE OF CONTENTS

1 CONTEXT 11 Background helliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(11) 12 Problemshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(12) 13 Possible approaches to improvehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(13)

2 HYPOTHESIShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(16)

3 PRECEDENTS

31 Collective decision-making toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(17) 32 Generative design toolshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18) 33 Related machine learning workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(18)

4 SOLUTION AND METHODOLOGY

41 Solutionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(20) 42 Datasethelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(21) 43 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(22) 44 3D voxelmesh reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(23) 45 Spatial data reconstructionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(29)

5 RESULT AND EVALUATION

51 Extracting building typologieshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 52 3D reconstruction resultshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(38) 53 Other evaluationshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(40)

6 DELIVERABLE AND CONTRIBUTION

61 Deliverablehelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(49) 62 Contributionhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(50) 63 Future Workshelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(51)

7APPENDICEShelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(52)

8BIBLIOGRAPHYhelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphelliphellip(55)

9

10

1 CONTEXT 11 Background City planning committees which include elected officials (or their designees) specialists and resident representatives evaluate urban design proposals (made by urban designers) of development projects in specific locations An urban design proposal involves a 3D urban morphology (the combination of 3D building envelopes on blocks) to visualize spatial demographic data (eg residents jobs transportation) and building typology (on parcels) A qualified urban proposal requires multiple stakeholders in the city planning committee to negotiate their individual ideas such as population strategy landfill context and building design Urban designers collect and translate these ideas into a 3D urban morphology and visualize them Building a 3D urban morphology traditionally is time-consuming for urban designers due to the heavy manual modeling workload of massive building geometries Therefore itrsquos hard to execute enough rounds of iterations before a proposal truly meets all complicated requirements of the city planning committee To avoid the manual modeling work of deploying each building typology to every parcel of a site some rule-based approaches (such as ESRIrsquos CityEngine [1]) generate 3D urban morphology from a 2D site via rules The generative rule is programmed with built-in functions translating design language to a script involving parameters of building and zoning code When the rule performs well a rule-based approach can live up to its advertising However how to make creating rules intuitive for designers who donrsquot have coding skills is the bottleneck of rule-based systems with the result that designers in the real world still avoid using them for making 3D urban morphology

fig1 Some built-in generative functions provided by ESRI CityEngine

11

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 10: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

10

1 CONTEXT 11 Background City planning committees which include elected officials (or their designees) specialists and resident representatives evaluate urban design proposals (made by urban designers) of development projects in specific locations An urban design proposal involves a 3D urban morphology (the combination of 3D building envelopes on blocks) to visualize spatial demographic data (eg residents jobs transportation) and building typology (on parcels) A qualified urban proposal requires multiple stakeholders in the city planning committee to negotiate their individual ideas such as population strategy landfill context and building design Urban designers collect and translate these ideas into a 3D urban morphology and visualize them Building a 3D urban morphology traditionally is time-consuming for urban designers due to the heavy manual modeling workload of massive building geometries Therefore itrsquos hard to execute enough rounds of iterations before a proposal truly meets all complicated requirements of the city planning committee To avoid the manual modeling work of deploying each building typology to every parcel of a site some rule-based approaches (such as ESRIrsquos CityEngine [1]) generate 3D urban morphology from a 2D site via rules The generative rule is programmed with built-in functions translating design language to a script involving parameters of building and zoning code When the rule performs well a rule-based approach can live up to its advertising However how to make creating rules intuitive for designers who donrsquot have coding skills is the bottleneck of rule-based systems with the result that designers in the real world still avoid using them for making 3D urban morphology

fig1 Some built-in generative functions provided by ESRI CityEngine

11

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 11: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

1 CONTEXT 11 Background City planning committees which include elected officials (or their designees) specialists and resident representatives evaluate urban design proposals (made by urban designers) of development projects in specific locations An urban design proposal involves a 3D urban morphology (the combination of 3D building envelopes on blocks) to visualize spatial demographic data (eg residents jobs transportation) and building typology (on parcels) A qualified urban proposal requires multiple stakeholders in the city planning committee to negotiate their individual ideas such as population strategy landfill context and building design Urban designers collect and translate these ideas into a 3D urban morphology and visualize them Building a 3D urban morphology traditionally is time-consuming for urban designers due to the heavy manual modeling workload of massive building geometries Therefore itrsquos hard to execute enough rounds of iterations before a proposal truly meets all complicated requirements of the city planning committee To avoid the manual modeling work of deploying each building typology to every parcel of a site some rule-based approaches (such as ESRIrsquos CityEngine [1]) generate 3D urban morphology from a 2D site via rules The generative rule is programmed with built-in functions translating design language to a script involving parameters of building and zoning code When the rule performs well a rule-based approach can live up to its advertising However how to make creating rules intuitive for designers who donrsquot have coding skills is the bottleneck of rule-based systems with the result that designers in the real world still avoid using them for making 3D urban morphology

fig1 Some built-in generative functions provided by ESRI CityEngine

11

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 12: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

12 The problems of rule-based systems This study aims to improve the rule-creating process in terms of its limitations of creativity data-driven capacity first-time time-consumption and coding skill requirement 121 Creativity Although building masses can be generated by rules they are hard to be called ldquocreativerdquo because of the limited built-in functions (fig1) provided by rule-based systems (eg setback subdivision LOU-shape) It is difficult to simulate ideal urban morphologies made by master urban designers as well as make detailed adjustments in common ones (fig2) Both decision-makers and urban designers tend to avoid producing homogeneous morphologies in their design making them resistant to rule-based systems An acceptable approach should be able to evaluate novel urban morphologies Creativity is important for market applications

fig2 Urban morphologies in the existing urban context ideal urban design common urban design and

generative design 122 Data-driven capacity Apart from making novel forms rule-based systems cannot utilize all data produced by urban studies such as traffic congestion job-resident balance and population growth Utilizing spatial data within an urban design proposal is called ldquoGeoDesignrdquo offered by ESRI the same software producer of CityEngine Ideally this new decision-making workflow allows multiple users (who can either be decision-makers or urban designers) to contribute their data into a weighted overlapping system for visualizing It only achieves limited effects because of the complicated relationship between building typology and data For instance (fig3) a tower building with a large ground square shares the same Floor Area Ratio (FAR) with a low warehouse occupying most of the parcel A machine learning-empowered approach can play a part here matching the

12

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 13: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

empirical outcome of building typology instead of organizing the role and factor of each data in an ambiguous formula

fig3 Building typologies with similar FARs

123 First-time time-consumption As a part of the discussions above rule-based approaches require a large decision tree to simulate the shape grammar of a building typology The cost of using rule-based approaches is always larger than that of manual modeling work the first time when designers use it in a proposal because they need an excessive amount of time constructing specific rules for the project Urban designers therefore would not like to embrace rule-based approaches as a part of their design process before they can enjoy the saving of time from applying a rule recurrently 124 Coding skill requirement Creating rules for building typologies and spatial data requires an understanding of both programming and design Since only a few urban designers have such a combination of skills this requirement impedes its broader application Also as the clients and end-users of urban development planning committee members should be able to provide many aspects that are ignored by urban designers which means a user-friendly approach with the features of creating rules with low skill requirements and real-time visualization will bring more opportunities to larger groups of participants and facilitate decision-making iterations 13 Possible approaches to improve To solve the bottleneck of creating rules we need help from machine learning techniques similar to how the bottleneck of the expert system (rule-based system) had been broken through in the history of artificial intelligence In recent years 3D reconstruction from 2D images has become popular in computer graphics (CG) and computer vision (CV) studies replacing some algorithms in classic computational theories It runs by minimizing the loss between 3D predicted objects from 2D images and their ground truth 3D models The multiple ways to realize 3D reconstruction from single images in the format of voxel [234] point cloud [56] and mesh [67] inspired me to explore the replacement of a) creating rules by built-in functions with b) creating rules by image references which is much more intuitive

13

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 14: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

This thesis explores the possible approaches to create ldquorulesrdquo for generating building typologies (fig4) The first two approaches (voxelmesh) derive from the above general 3D reconstruction pipeline and will be introduced briefly as references The last spatial data approach is the final solution to this study

fig4 Three types of 3D output

131 3D voxelmesh reconstruction from a single 2D image The first two approaches develop methodologies based on general 3D voxelmesh reconstruction pipelines The first one utilizes a Multi-View LSTM (Long-Short Term Memory) neural network (3D-R2N2 [2] implementation) to reconstruct 3D voxel urban morphology from single or multiple images By training on 3D building models from GIS datasets and their 2D renderings it can predict a 3D voxel model of urban morphology from a new aerial rendering image However the voxel outcome is not easily capable of becoming clean building geometry or of being used in urban design modeling software Due to the high computational cost of large 3D voxel data improving its performance is very challenging To distribute the computation workload the second approach proposes two machine learning neural networks 1) one to translate an aerial rendering image of building typology to a 2D location map and 2) one to produce a 3D mesh reconstruction from rendering the image and the 2D location map (Pixel2Mesh [7] implementation) This multi-task approach offers urban designers flexible options for which parts of manual workflow they want to replace The outcome is a closed mesh object for every building geometry although it is not simplified enough to be directly used in urban design modeling software

14

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 15: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

132 Spatial data reconstruction from a single 2D image

fig5 The network of the final solution

Rather than converting our data structure to the ones used in general 3D reconstruction pipelines our final approach learns and produces spatial geometry data (GeoJSON-like) only for the urban design usage Our neural network (fig5) converges the input data of spatial parcel data and 2D aerial rendering images to the ground-truth label of spatial building data Crucially spatial geometry construction is compatible with general GIS platforms The optimization is at the level of spatial geometry (2D with properties) bringing a much lighter computational cost When 3D urban morphology can be constructed from image references and preserves the essential information (eg location size and building alignment) a ldquorulerdquo of generating building envelopes from a parcel can be made by either decision-makers or urban designers intuitively and free of the limitation of current rule-based approaches (fig6)

fig6 rule-based pipeline vs ours for generating building typology from a single parcel

15

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 16: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

2 HYPOTHESIS By creating rules of synthesizing 3D building typologies via machine learning pipeline 3D urban morphology can be generated from rules intuitively facilitating the decision-making process in urban development Key terms in this context

Block land separated by streets or a collection of parcels which are owned by different landlords

3D building typologies 3D building envelopes on a parcel visualizing size alignment and style

Machine learning pipeline AI-empowered approach in the computer vision and computer graphics realm There are also non-machine-learning algorithms of computer graphics for data processing and feature extraction These techniques are used for matching the features of building parcel to the ones of image references

Synthesizing using linear interpolation in latent space to create novel outputs

3D urban morphology a collection of 3D building typologies It represents the relationship (eg distances and other combinatorial rules) among buildings on blocks

Rules the shape grammar generating building envelopes from a parcel geometry via the geospatial information of the parcel and parameters of buildingzoning code The core algorithm in rule-based systems

Decision-making process a) Urban designers use 3D urban morphology for

visualizing ideas from decision-makers b) decision-makers make decisions of urban development based on the evaluation (eg FAR ROI transportation) of urban morphology decision-makers usually involve elected officials (or their designees) specialists and resident representatives

16

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 17: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

3 PRECEDENTS The hypothesis and approaches are inspired by precedents in the multidisciplinary discussion of collective decision-making tools generative design computer graphics and machine learning 31 Collective decision-making tools for urban development 311 CityMatrix Since an urban design proposal always gathers numbers of decision-makers to negotiate their ideas together how to create a collective workspace for decision-making becomes the target of related studies Yan Zhang [9] in his ldquoCityMatrixrdquo provided a collective decision-making platform by using Lego toys as a tangible interface where Lego blocks are used to represent buildings ldquoCityMatrixrdquo augmented the Lego interface via machine learning computation which can deliver instant feedback about the socio-economic impacts of each change made by decision-makers The Lego interface is friendly to users especially for the public without the experience of working on professional CAD or GIS tools However since streets are not always regular the modules and grids can only be feasible in a small number of cases The pick-and-drop process of the Lego interface also slows down the speed of decision making a scenario of 16 blocks needs 40~60 minutes to be built The main takeaway is that CityMatrix provides a low-skill-required pipeline of decision-making in urban development 312 Geoplanner for ArcGIS Another participatory design attempt derived from Geographic Information System (GIS) platform is Geoplanner for ArcGIS [10] Taking advantage of open-source spatial data researchers can now analyze and visualize buildings and streets In 2014 ESRI announced the product line of ldquoGeoDesignrdquo a new workflow allowing decision-makers to collaborate similar to using Google Docs (an online sharing text editor) in urban development Users (specialists from different realms) can input various data sources into a weighted overlapping system and compare different urban development scenarios By integrating features from desktop GIS software Geoplanner is a potential solution for making decisions about urban development on a one-stop platform It proposes a transparent decision-making process absorbing massive input of data and ideas The weighted system also inspires our idea to synthesize novel outcomes from multiple design ideas

17

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 18: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

32 Generative design tools 321 CityEngine In 2008 ESRI purchased a computational design lab at ETH and polished their study of a generative urban design software platform called City Engine The software has geometric algorithms that immigrated from ArcGIS and ETH computational studies Its core is a shape grammar language called Computer Generative Architecture (CGA) CGA can utilize geospatial data (shapefile format) and generate realistic 3D models from a single geometry It also can take user input as non-spatial parameters This solution is broadly applied as procedural modeling in the industries of animation (eg Big Hero 6 Zootopia) and 3D video games (eg Assassin Creeds series) As stated before CityEngine can quickly update the 3D model via parameter manipulations However the problems of creativity data-driven capacity first-time time consumption and skill requirements still impede its application in the real-world decision-making process 33 Related Machine Learning Works 331 3D voxel reconstruction To reconstruct 3D shapes from 2D pixels 3D voxel data is an option of data transition due to its compatibility of applying image algorithms Jiajun Wu and his colleagues at MIT Computer Vision Group contributed a series of studies of 3D reconstruction from 2D images They used MarrNet [4] to reconstruct 3D IKEA furniture from a single 2D image by training on a correlated 2D and 3D model dataset Girdhar and his team [13]

realized logical shape arithmetic via their TL-embedding network creating 3D novel outputs 3D voxel reconstruction inspired this study the possibility of 3D reconstruction As they presented a voxel model can transform smoothly to another by changing features in latent space linearly gradually which is able to create various novel outputs for design purposes However the output voxel is not a format that is compatible with most design modeling software 332 3D mesh reconstruction In mesh reconstruction pipelines the core idea is stretching(deforming) a basic spherersquos control vertices and matching the stretched geometry with its ground truth geometry through evaluating the chamfer distance [7] or the pixel differences of virtual rendering [8] To produce a more precise outcome more loss functions like facial normal and edge-length are also used in Pixel2Mesh

18

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 19: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

The mesh reconstruction outputs a feasible model for design modeling software Recall this theory basic 3D mesh reconstruction is only able to reconstruct a single object in a task which is not compatible with our urban morphology case--there is usually more than one building on a parcel Our solution builds a different pipeline from mesh reconstruction but uses similar loss calculations as our data structure is also vectorized 333 Object detection To slice a multi-objects task into single-object tasks Mask R-CNN [14] is a potent way extending Faster R-CNN [15] by adding a branch for predicting an object mask as well as the bounding box recognition The predicting segmentation masks randomly convolute the image to predict the Region of Interest (RoI) as classification and bounding box regression Comparing to object detection methods like DenseNet [16] or YOLO [17] Mask R-CNN keeps the balance between accuracy and prediction speed Mask R-CNN became popular in the computational urban planning studies in recent years Images from OSM [18] used Mask R-CNN to enrich the segments in OpenStreetMap by predicting the sports fields from satellite imagery maps Meanwhile the AI research group from ESRI used Mask R-CNN to classify roof typologies from LiDAR satellite data [19] providing more delicate visualization in 3D scenes

334 3D model dataset As Wang et al [20] demonstrated researchers can easily gather any kind of model from the Trimble 3D model warehouse This warehouse is open-source and free to download massive shapes especially 3D famous single buildings They generated 2D images taken by a surrounding virtual camera from different angles as the training set of 3D recognition or reconstruction studies In this thesis a huge challenge comes from the lack of a feasible way of making 3D models for the training dataset Wangrsquos methodology of making the dataset inspires many later reconstruction studies and this study

19

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 20: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

4 SOLUTION AND METHODOLOGY 41 Solution To establish an intuitive methodology of creating ldquorulesrdquo of building typologies we can start by interpreting how human urban designers make typologies in the classic workflow and how rule-based approaches create rules In a typical workflow designers firstly receive a project description with the requirements from clientsusers and the constraints on a site (spatial boundary) And then designers usually start with searching urban morphology or building typology precedents from their experiences or image references on search engines like Pinterest and Archdaily Clientsusers will also provide their favorite image references ensuring urban designers understanding their preferences and requirements Urban designers need to extract building typologies from references and draw diagrams as the prototypes of building typologies Afterward they adjust these prototypes into different parcels on their site in 3D modeling software (eg Rhinoceros SketchUp) based on the parcel shape and street orientation Urban designers also need to evaluate whether the adjusted building typologies satisfy the building or zoning code After these works urban designers assemble all adjusted building typologies on their site as a comprehensive 3D urban morphology for renderings and presentations Once approved by decision-makers this 3D urban morphology will be archived as a geospatial dataset in GIS platforms for further urban data management As addressed in the introduction section in current rule-based approaches (eg CityEngine) urban designers can create rules to generate building typologies from parcels avoiding the manual modeling work of drawing 3D geometries in 3D modeling software Urban designers need to translate a building typology to a rule which is extracted from references From a technical perspective urban designers write the code of a big decision tree organizing corresponding built-in functions and parameters from the properties of geospatial data The properties usually include area perimeter land-use height limit Floor Area Ratio greenspace coverage building coverage etc Therefore urban designers have to consider many cases to adjust their building typologies to the parcels on their site The efficiency empowered by the rule-based system is thus harmed by the process of translating image references to rules That is rule-based systems require decision trees to link a) the features from image references to b) the features from the corresponding 3D building shapes

20

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 21: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

Find building typologies from reference images

Translate design language to scripts

Adjust to a site Output data as Update (after collecting decision-makersrsquo comments)

manual Find by experience

Analyze the building typology

Draw building footprints and extrude them

Static 3D meshNURBS models Static 2D drawings

Draw again from building footprints

rule-based Find by experience

Create a rule via built-in functions

Apply a rule onto parcel geometries

Generated 3D mesh models Geospatial data Rule script

Change parameters or apply onto new parcel geometries

spatial data reconstruction (ours)

Extract by computer

Use a pre-trained model

Predict spatial buildings for parcel geometries

Geospatial data Trained model

Modify or predict again from new parcel geometries

tab1 the mechanisms of manual rule-based and our approach (tab1) In contrast this thesis study explores improving replacements by a) utilizing computational algorithms to extract features from reference images parcel geometries and building geometries b) matching features (image references + parcel geometries) to features (building geometries) via neural networks instead of decision trees After creating rules intuitively designers can apply rules to their site enjoying the same advantages of rule-based systems in the following stages achieving output data as geospatial data and updating geometries as groups 42 Dataset To allow computer learning 3D building typologies a collection of 3D building models is necessary Our raw data are collected from open-source datasets (OpenStreetMap) and the city public data warehouse (tab2) Two raw datasets are required in the series machine learning pipelines a) parcel geometry with planning properties (eg land-use code height-limits) and b) building footprint geometry with height information They will be augmented by scripts and prepared for different machine learning pipelines (see in corresponding sections)

21

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 22: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

city source (parcelbuilding)

Parcel count

Parcel properties

Building count

Building properties

Los Angeles

SCAG_county_zoning Lariac 2008 building footprint

2376370 Land-use zoning height limit

3141244 Height elevation

tab2 The list of raw data 43 Extracting building typologies Due to that our resulting ldquorulerdquo is purposed to generate building typologies from parcel geometries 3D reconstruction pipelines should process at the parcel level Hence this first network performs before 3D reconstruction extracting 2D aerial images of building typologies from input images of urban morphology (at the block level) The trainingvalidating data is augmented from the raw spatial dataset Given 2D geometries of building footprints with height information we extract them via Blender script and achieve a 3D model file (obj) of 3D building envelopes on each block The input images are RGB images taken by 24 virtual cameras surrounding each 3D model via Blender script These images are converted as binary images by masking each parcel of this block The multiple binary images are stacked as a multi-channel image for each view Finally a block model generates a) input data 24 views of aerial images ( ) and b) ground truth label 24 multi-channel mask images ( ) where N is the number of parcels in this block (formula1) Randomly center cropped and randomly horizontal flipped will serve during data loading to avoid overfitting The augmented dataset separates into training and validating data by a ratio of 08

formula1 the ground-truth label as a multi-channel image N is the

number of parcels in a block

After loading the dataset a Mask R-CNN network uses ResNet 101 as the base model predicting masks as output from input images (fig7) This part of the pipeline is the implementation of Hersquos Mask R-CNN network [14] including Region Proposal Network (RPN) and predictions (classification box and binary mask prediction) For each Region of Interest (RoI) the loss function is constructed by the loss of classification bounding-box and mask The output extracted parcel rendering images will serve as a part of input in the following 3D reconstruction approaches

22

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 23: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

fig7 The pipeline of Mask R-CNN [14]

44 3D reconstruction In this thesis study three reconstruction approaches had been attempted on different types of inputs and ground-truth labels The first two approaches are developed on general 3D construction and serve as references The comparisons of the approaches (tab3) and 3D model formats (tab4) are shown below While voxel and mesh models are popular in general machine learning studies our final solution is built only for urban planning cases absorbing the corresponding techniques from the first two approaches This GeoJSON-like data (stored as geometries and properties) can be converted losslessly and used in most GIS platforms Approach Input Ground truth label Software platform

for data processing

1 3D voxel reconstruction

Aerial image of building typologies(png)

3D Building model (Voxel mat) ArcGIS Pro CityEngine Binvox PyTorch

2 3D mesh reconstruction

Aerial image of building typologies(png) 2D bitmap(png)

3D Building model (Point cloud xyz)

ArcGIS Pro Blender Tensorflow(Keras)

3 Geospatial data prediction

Aerial image of building typologies(png) Parcel data (csv)

Building data (csv) QGIS Blender Tensorflow(Keras)

tab3 data structure of three approaches

23

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 24: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

Voxel Point cloud Mesh Nurbs GeoJSON-like

Data loading N x N x N x 1 (massvoid)

N x 3 (x y z) N x (v1 v2 v3)

Degreecontrol ptsweightsparams

2D Geometry (longlat N x 2) Property(with height or more info)

Reconstruction from 2D pixel from 2D pixel Deform from a spherecube

Translate from Mesh Detect shape grammar

Deform from 2D geometries

Evaluation Logical is or Not Intersection of Union (IoU)

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Silhouette Rendering Logical is or Not

Chamfer distance (CD) Earth Moverrsquos Distance (EMD)

Software Minecrafthellip 3D scanninghellip

SketchUpGIShellip

Rhinohellip GIS

ML project 3D-R2N2 MarrNethellip

PointNetPointNet++hellip

Pixel2Mesh Neural Renderer

tab4 the comparison of 3D geometry formats 441 3D voxel reconstruction The first attempt implements the 3D-R2N2 approach which is originally able to 3D reconstruct ten categories of furniture in the training dataset In the implementation of synthesizing 3D urban morphology on the 3D-R2N2 approach input and ground truth data have been modified for feeding 3D building models into the network Also the hyper-parameter of the 3D-R2N2 network has been adjusted for 3D building models 4411 Data structure of trainingvalidating dataset The trainingvalidating set is following the ShapeNet structure which contains two parts derived from 3D models input rendering images and ground truth labels (fig8)

24

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 25: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

fig8 examples of trainingtest dataset

The input aerial images are taken by 24 virtual cameras as the same as the ones used in the network of extracting building typologies ( in original 3D-R2N2) An ArcGIS Python script exports each block of buildings on it to ESRI Shapefile (shp) Then a CityEngine Python script exports every block with 3D buildings to a mesh model (obj) Each ground truth label stores as voxel data converted from a mesh model via Binvox into nd-array ( in original 3D-R2N2) The entire dataset includes 177 models and 4240 rendered images which is separated into training and validating dataset by the ratio of 08 As a multi-view 3D reconstruction a 3D-LSTM (Long-short Term Memory) takes consideration of multi-view images as a time-sequential data leading a 5D input data [view_id batch_id channel width height] Inputs for each model are randomly picked from 24 images for random times (in a range from 1 to 5) The picked images will be randomly center cropped and randomly horizontal flipped to avoid overfitting in the training process The labels as voxel data are constructed with five dimensions [batch_id the channel of masks x-axis y-axis z-axis] The channels represent original or masked objects (entity true or false) 4412 Network

fig9 3D-R2N2 network architecture [2]

As demonstrated in fig9 3D-R2N2 is composed of three parts a 2D Convolutional Neural Network (encoder) a 3D Convolutional LSTM and a 3D deconvolutional Neural Network (3D-DCNN) Given the encoded input a set of proposed 3D Convolutional

25

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 26: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

LSTM (3D-LSTM) units either update their cell states or retain the states by closing the input gate At time step t when a new input is received the operation of an LSTM unit can be expressed as refer to the input gate the output gate and the forget gate respectively and refer to the memory cell and the hidden state respectively (formula2)

formula2 3D-LSTM kernel forget and update gate [2]

Finally the 3D-DCNN decodes the output of 3D-LSTM units and generates a 3D probabilistic voxel of The prediction is the probability of the existence of voxel cell at using voxel-wise softmax A voxel data can be visualized as a 3D heatmap or solid 3D model by setting a threshold 442 3D mesh reconstruction A voxel model is difficult for cleaning and being simplified to a feasible model in urban design modeling software Also the voxel model stores data inside closed objects which is inefficient and will become increasingly large after scaling Therefore the next exploration focuses on a mesh format compatible with design modeling software As inspired by MarrNet additional features can improve the accuracy performance in the training process Since a building typology will be influenced by the properties like height limit or land-use additional bitmaps of these properties can help our network in the training process To distribute computational workload two machine learning neural networks have been proposed a) translate a 2D image of building typology to a top view location map and b) 3D mesh reconstruction 4421 Data structure of trainingvalidating dataset In the parcel-level an aerial image will be cropped from its parent block-level image and scaled to Additional information like height-limit and land-use will be stored as gray-scale images ( ) on the 2D top view Ground truth labels are 3D point clouds with normals (6 dimensions totally) calculated from mesh models (fig10)

26

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 27: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

fig10 examples of trainingvalidating dataset

4422 Network A translate a 2D image of building typology to a top view location map Network A is taken multiple inputs including extracted building typology images 2D parcel shape image and additional bitmaps (like height limit and land-use) to predict a 2D building location map that fits the shape of a target parcel As mesh reconstruction will not preserve the location of 3D output objects this location map is used to place the reconstruction results

fig11 multi-task GAN structure

27

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 28: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

The input images are a sequence of multi-view aerial renderings of a single parcel taken by 24 surrounding cameras and cropped by the bounding box of this parcel N is the number of views a parcel randomly in a range from one to five (fig11) This

image data is encoded by a ResNet50 model producing an feature vector in the latent space The feature array merges into a column vector via Gated Recurrent Units (GRU)

formula3 the concatenated mid feature vector

Thus we can get a feature vector after learning a sequence of renderings of a single parcel This feature vector will be decoded multiple times comparing with additional constraints bitmaps ( ) and encoded again as

columns The concatenated column vector ( formula3) stores the feature of building typology the shape of its parent parcel (from the parcel mask) and the additional constraints (eg the height-limit bitmap) The following parts of the network are decoder and generator as a classic GAN In iterations (formula4) we have the MSE loss ( ) of 2D height map prediction the least squared reconstruction loss ( ) of a reconstructed rendering image (compared to all angles of renderings of this parcel and return the minimum one) the MSE losses ( ) ) of additional constraints as multiple losses The final combined loss will be optimized by Adam with a learning rate of 0001

formula4 combined loss

4423 Network B 3D mesh reconstruction

fig12 Pixel2Mesh network structure [7]

Network B is an implementation of Pixel2Mesh (fig12) The training is to match the extracted images of building typology with their ground truth 3D model (stored as point

28

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 29: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

clouds with normals) by their losses of the chamfer distance loss normal loss and Laplacian regularization In Pixel2Mesh the Chamfer loss constraints the location of mesh vertices the normal loss enforces the consistency of surface normal the Laplacian regularization maintains relative location between neighboring vertices during deformation and an edge length regularization to prevent outliers These losses are applied with equal weight on both the intermediate and final mesh

For each Graph Convolutional Network (GCN) in Pixel2Mesh the basic ellipsoid is deformed by the product of its control points and vertice-translation matrices The number of control points will be un-pooled from 156 to 2466 The network takes input images as and initial ellipsoid with 156 vertices and 462 edges The achieved 3D mesh models will be placed by location maps predicted by Network A 45 Spatial data prediction In the second approach predicting a location map of buildings on a parcel and reconstructing complex 3D models cost too much computational energy (CPU GPU RAM) Our final solution uses the spatial geometry format (GeoJSON-like) of current rule-based approaches avoiding the unnecessary computational usage on images and mesh data In this approach we adjust the concepts of dataset augmentation 3D-LSTM loss functions from the previous two approaches to urban design usage 451 Data structure of trainingvalidating dataset The trainingvalidating datasets are made by QGIS Blender OpenCV scripts extracting geospatial data from raw data Since the raw dataset is biased with respect to the land-use type and the numbers of buildings (single-family residential two-building parcel is occupying the most) we can scale them by land-use and number factors to 10216 samples (tab5) This can avoid our network to overfit the bias of raw datasets

29

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 30: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

tab5 The list of raw data (upper) and scaled training dataset (lower) LA sample

The input of this solution includes two elements (fig13) 2D aerial rendering images and spatial parcel data As the previous approaches the images are extracted from 24 aerial rendering images of building typologies via their bounding-boxes Additionally in this solution an isolated parcel rendering dataset is prepared as a reference group to test the influence of surrounding buildings Also 2D top view images augmented as the style of shadow wireframe fill and mass-void are collected as another dataset to test the different learning results between 2D and 3D input information Spatial parcel data are csv files with geometries and properties By interpolating simplifying (fig14) and smoothing (fig15) each parcel geometry contains 16 vertices of longitude and latitude (13 is the average number of vertices of a geometry shape in the raw dataset) In properties each parcel includes area perimeter height limit and land-use code information which can be expanded more from user input The csv table of each spatial parcel data input is

30

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 31: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

The ground-truth label is spatial building data corresponding to parcels As well as a parcel geometry each building footprint geometry is converted to 16 vertices of longitude and latitude For each csv table of spatial building data its property has the number of buildings in the parcel ten height values and ten geometry data The table thus includes 331 values ( )

fig13 Sample format of the training set

fig14 (pseudo code) Interpolating and simplifying a geometry

fig15 (pseudo code) Smoothing a geometry

31

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 32: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

Before feeding the dataset into the network we need to scale the data from the urban scale to [-1 1] or [0 1] interval which can be learned by neural networks After analyzing the dataset we can filter the raw dataset and scale them via the corresponding extent preserving all features (tab6) Especially we normalize each parcel and building geometry by the max extent of its parcel boundary (vertices) sparsing the value of coordinates

Field name

extent -9980~9994 -9988~9989 396~34567 83~761 0~100

Scale method

Output extent

-1~1 -1~1 0~1 0~1 0~1

Preserve 10 10 10 10 10

Field name

extent -9928~9828 -9934~9982 1~9 1~985

Scale method

Output extent

-1~1 -1~1 0~1 0~1

Preserve 10 10 10 10

tab6 The conversion of the dataset

(from the entire trainingvalidating dataset)

32

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 33: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

452 Network The network (fig16) takes three inputs (images parcel geometry data parcel properties) and predicts three outputs (number of buildings building heights building geometry data) Firstly we extract image features via a 3D-LSTM (Long-Short Term Memory) which is following the methodology of 3D-R2N2 implementation in the first approach and a 3D-GRU (Gated Recurrent Units) which is used in Pixel2Mesh This network (ResNet in 3D-R2N2 or VGG16 as the base model) convolutes the input images (N views randomly are picked per parcel when N is in a range from 1 to 5) to a time-sequential feature (FC1_R ) and uses the LSTMGRU to merge them into a (FC1

) column vector The input gate the output gate and the forget gate memory cell and the hidden state are the same as formula2 in the first approach Secondly we convolute the input geometry and fully-connect property data of parcel as two feature vector (FC2 FC3) and dense (fully-connect) them as well as a dense feature ( ) of image feature (FC1) After concatenating these dense features (DC1 DC2 DC3) together (formula5) we have a feature vector (MID) in latent space The different number of features is inspired by the setting in 3D-R2N2 (1024) and Pixel2Mesh (1200) Three decoders predict the number of buildings building heights and building geometry as three outputs differently with the convolution layer only in geometry prediction

fig16 geospatial prediction network structure

33

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 34: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

formula5 the concatenated mid feature vector

Three loss functions (formula6) calculate the difference between prediction and ground-truth label in terms of the number of buildings building heights and the vertices of building footprints Especially we only consider the available values in Loss2 and Loss3 since the number of buildings is various in each parcel For an instance of a two-building parcel we only calculate the first two out of ten height values in Loss2 and the first 64 coordinate values in Loss3 Loss3 considers five parts to let the neural network learn the geometries including center relative vertices unit tangent vector curvature edge length and etc They constrain different geometry characteristics as (fig17)

Center the location of building footprints Relative vertices the shape and size of each geometry Unit tangent vector the rotation angle of each geometry Discrete curvature the corners of each geometry Edge length the edge length of each geometry Extent x and y the horizontal and vertical scale of building footprints Maximum discrete curvature the maximum angle of corners number of corners corners Normalized coordinates the shape and size of each geometry Absolute coordinate the shape and location of each geometry

fig17 ten losses of building geometry prediction

34

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 35: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

The final combined loss inclines more in Loss3 as they are the crucial parts of learning the shape of building footprints Since the losses related to coordinates have the maximum values of 2 (difference of edge length) and 4 (L1 distance of two coordinates) they are normalized respectively for better losses organization Thus all losses are in the range of 0 to 1 and they are balanced by

which considers the factors in Pixel2Mesh and adjusted by experimental results (see in the next section) The number of buildings

where and

10 building heights

where and

Center coordinates

where and

Absolute coordinates

where and

Relative coordinates

where and

35

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 36: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

Normalized coordinates

where and

Roll vertices

Normalized coordinates

Batch-normalized coordinates

Unit tangent vector

where and

Discrete curvature

where and

Edge length

where and

36

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 37: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

Extent x and y

where and

Maximum discrete curvature

where and

The number of corners

where and

Combined loss

formula6 Loss functions

37

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 38: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

5 RESULT AND EVALUATION As the final solution of this thesis study the third approach of spatial data reconstruction is the main evaluation in this section The extracting network will be introduced briefly and the first two 3D reconstruction approaches serve as reference groups The final approach is a solution with only urban design usage Hence its evaluations are tested by modifying design elements 51 Extracting building typologies The first Mask R-CNN implement model is trained for 80 epochs and achieved 011 bounding box loss 041 mask loss in training and 015 bounding box loss 040 mask loss in validating set (fig18) As we can see the network performs not well in the mask prediction but well in the bounding-box prediction (fig19)

fig18 trainingtest losses of extracting building typologies

fig19 Examples of extracting building typologies

52 3D reconstruction experimental result As shown in fig20 our final solution predicts cleanest and usable outputs as the resulting 3D models The voxel output has ambiguous probability models that cannot be cleaned or simplified without the bias of the algorithm Also voxel models will become heavier space computation when increasing their scales which is the same as the size comparison between vectorized and rasterized files Again voxel models are rare in urban design models with respect to its edibility However we can see the image

38

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 39: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

learning capacity from the different results predicted from high-rise and low-rise reference images Similarly to the first approach adjusting the current mesh construction is not able to produce acceptable urban morphology models As we can see the network that purposes to predict the location map generates similar results from various inputs Furthermore the network is not able to predict the angle of building footprint as well predicting unsuccessful rules of building envelopes The 3D mesh reconstructed results from Pixel2Mesh can show the distinguishable shapes of inputs but they also contain too trivial mesh faces for most cases of urban design models Linking these two networks (predicting location map and 3D mesh reconstruction) is not easy to avoid the bias of shape grammar Our final solution spatial data construction predicts the same format used by urban design tools Detailed evaluations are demonstrated in the next section

Voxel ______________________________________________________________________

Mesh ______________________________________________________________________

39

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 40: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

Spatial Data ______________________________________________________________________

fig20 Outputs from three approaches 53 Other evaluations Recall the combined loss function we have 7 factors to organize Considering the similar combined loss factors in Pixel2Mesh and the experimental results our factors are finally setting as formula7 As shown in fig21 they are decreasing gradually in 50 epochs with Adam optimizer (learning rate 1e-4)

formula7 The factors of Loss functions

40

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 41: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

fig21 Loss plotting of the final solution

Also we test the ablation effect of various loss functions (see in tab5) The baseline test uses VGG16-GRU with all loss functions training 25 epochs on a cropped dataset The output geometries are also shown as images in tab6 We use these test results for further adjustments of multiple loss functions

without the loss function of

loss ( ) Base- line

3D-R2N2- GRU

VGG16-LSTM

isolated dataset

Top- view

relative distance

unit tangent vector

curvature

center edge length

loss 08842 09645 08745 08351 09704 08353 07166 07189 04403 08416

loss 1 (05) 00601 00665 00598 00473 01226 00638 00583 00577 00489 00621

loss 2 (01) 00658 00636 00658 00507 00949 00652 00647 00652 00584 0066

loss 3A (1) 00518 00519 00514 00527 00324 00589 00525 00478 00498 005

loss 3B (1) 01796 01793 01794 0184 00947 01795 01947 01839 01754 01801

loss 3C (1) 01623 01614 01618 01632 01381 01616 01744 02392 01602 01636

loss 3D (10)

00433 00514 00437 00371 00592 00439 00433 00432 00598 00435

loss 3E (1) 00445 0045 00448 00408 00428

00427 0044 00424 00441 00514

tab5 losses of ablation test The learning of the isolated dataset with VGG16-GRU shows the best among the reference groups and the losses of geometries demonstrate their relevant effects with each other 2D top-view dataset shows only a slightly worse in heightsnumbers but

41

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 42: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

better in the shape loss In tab6 we can also see the constraint of shape corner angle position in the different ablations of loss functions

tab6 spatial building predictions of ablation test

Besides the loss plotting we plot the intermediate latent space with respect to two inputs (building images and parcel geometries) and their output building geometries The latent space plots (fig22 to fig25) are shown the distribution of the DC1DC2 layer ( ) after T-SNE dimension reductions representing the learning capacities of reference images and geometries that cannot simply be determined by values The isolated dataset still shows the best distribution visually

fig22 T-SNE mapping of reference images encoded features (DC1)

42

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 43: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

fig23 T-SNE mapping of parcel geometries encoded features (DC2)

fig24 T-SNE mapping of reference images encoded features

isolated dataset (DC1)

43

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 44: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

fig25 T-SNE mapping of reference images encoded features

2D top-view dataset (DC1) As a design-assist machine learning pipeline it is crucial to manipulate the learning network to match various design ideas In fig26 to fig29 we can see the capacity of the network and how it interpolates to create novel outputs All interpolations are calculated after encoded as latent feature vectors (1000 for reference images 300 for parcel geometries or parcel properties) In the building interpolation we can see how the prediction changes from one building to two In the parcel interpolation we can find the intermediate prediction of two shapes of parcels The land-use predictions show more buildings in single-family residential predictions and larger single buildings in the mixed-use parcel Also the smaller parcel predicts also lower and more decentralized buildings comparing to the dense and taller ones in the larger parcel

44

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 45: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

fig26 Test results of difference building typologies

45

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 46: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

fig27 Interpolating shapes of a parcel

fig28 Test results of different land-use of a parcel

46

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 47: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

fig29 Test results of varying the area of a parcel

After testing on the cropped image dataset we also compare the results trained from the isolated image dataset and top view dataset (see in fig30 and fig31) By using the same network and setting we can get better heights in the isolated dataset and better sizethe number of buildings in the 2D top view dataset

fig30 Test results of the isolated image dataset

47

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 48: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

fig31 Test results of the 2D top-view image dataset

48

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 49: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

6 DELIVERABLE AND CONTRIBUTION 61 Deliverable After training and optimizing the final solution can serve two functions integrating rule-based approaches in the decision-making process of urban development a) create a customized ldquorulerdquo intuitively and b) apply a ldquorulerdquo and achieve a GeoJSON-like csv file including the information of 3D urban morphology After selecting the parcel dataset building footprint dataset and the information alias names (eg height height limit FAR etc) users can run a new training and validating process on their computers By using the pre-trained model provided from my study or a customized one users can input both geospatial geometries (eg one or multiple parcels) and image references of building typologies achieving the geospatial building geometries with high information as a csv file This csv file can be converted losslessly to other GeoJSON-like formats (eg shapefile topojson geojson) for most GIS platforms By simply reading the z-value (extrusion height) a 3D urban morphology can be visualized in most platforms In other words (fig32) the network predicts a translation matrix of weights learned from the training set generating building footprint geometries from a parcel geometry as shape grammar does separately

fig32 How weights from image work to generate geometries

as rule-based pipeline does

49

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 50: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

62 Contribution Comparing to current rule-based approaches our approach keeps the advantage of the data archive and improves the process of making a rule (translating building typology to a script) Because of that this thesis contributes to the four corresponding shortages of current rule-based approaches stated in the introduction section creativity data-driven capacity first-time cost and skill requirements Other contributions are also achieved from the study

621 Creativity Since rule-based approaches only provide limited built-in functions for generating building footprints from a single geometry users can only translate limited design language to a script as building typology Our approach is an option allowing users to translate any strategy into a shape grammar via image references Meanwhile novel or complex building typologies can be translated as rules liberating the possibility of creating a unique urban morphology 622 Data-driven capacity Leveraging multiple sources of data to generating geometries is hard for rule-based approaches especially when involving a larger number of data categories together (eg FAR coverage height limit) Our approach does not require decision trees to handle and cooperate with multiple sources of data directly producing a matching rule from given data to outcome form Users can thus enroll any data into the agenda once they have a training dataset linking these properties to building typologies 623 First-time time-consumption As a part of the discussion above the first-time cost of translating a building typology (either precedent from image references or new ideas from designersrsquo minds) is a pain of the rule-based system It reduces the speed advantage of rule-based approaches when applying building typology onto a site Our approach breaks this limit and brings a better experience to users from the beginning of the visualizing urban morphologies 624 Skill requirement As writing a rule requires both programming and urban design skill very few users can manipulate a rule-based system A bigger opportunity provided by our approach delivers to every decision-maker They can simply tell their favorite scenarios through image references collected online or choose a suggestion from the ones provided by professional urban designers

50

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 51: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

625 Other contributions The application of our approach can also bring rapid decision-making iterations in urban planning or design It is capable to visualize decisions strategies ideas on a site without a gap period in the manual urban design or rule-based system It not only saves time but also makes decision-making more fluent by an instant comparison of two strategies or an accurate translation of decisions from clientsusers Meanwhile our approach serves the deliverable features via open-source python scripts of QGIS Blender and Tensorflow (Keras) without the dependency of commercial GIS platforms like ESRI ArcGIS or CityEngine This reduces the cost of licenses and attempts Also this study compares the difference between the existing 3D reconstruction approaches with respect to urban design usage and creates a specific approach for urban design tasks The exploration of data structure and data augmentation can provide experimental helps to other relative studies The final solution saves a lot of computational energy when we compare the batch size (the number of parcels calculated per iteration) On one TITAN RTX GPU with 24GB graphics memory 2 for voxel pipeline 3 for mesh one and 128 for ours This allows for more improving space in future works 63 Future works After the fundamental pipeline built in this study we can imagine the future steps for improving the relevant AI-empowered design-assist pipelines Firstly the input rendering images can be augmented to more styles like photorealistic pictures wireframe line drawing or vivid scenes with landscapes By involving more scripts of shape grammar rule-based systems can create mass training datasets with various styles feeding these future elements Secondly the machine learning pipeline can also learn the roof and facade types (curtain or brick wall) as a part of building properties (see the roof type learning project [19]) After recognizing types from image references the visualization can follow the shape grammar of the existing rule-based system to generate realistic buildings rather than building envelopes Last but not least a hierarchy learning system may potentially enhance the pipeline in this study (like the general approach [21]) It can contribute to predicting the relationship between buildings and auxiliaries which means the possibility of predicting a detailed building output The Mask R-CNN can also be integrated in the encoding part of the network to achieve a better multi-objects learning result

51

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 52: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

7 APPENDICES More experimental results

52

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 53: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

53

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 54: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

54

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55

Page 55: SYNTHESIZING 3D MORPHOLOGY FROM A COLLECTION OF URBAN …

8 BIBLIOGRAPHY

1) CityEngine Official Website httpswwwesricomen-usarcgisproductsesri-cityengineoverview 2) Choy Christopher B et al 3d-r2n2 A unified approach for single and multi-view 3d object

reconstruction European conference on computer vision Springer Cham 2016 3) Wu Jiajun et al Learning a probabilistic latent space of object shapes via 3d

generative-adversarial modeling Advances in neural information processing systems 2016 4) Wu Jiajun et al Marrnet 3d shape reconstruction via 25 d sketches Advances in neural

information processing systems 2017 5) Garcia-Garcia Alberto et al Pointnet A 3d convolutional neural network for real-time

object class recognition 2016 International Joint Conference on Neural Networks (IJCNN) IEEE 2016

6) Fan Haoqiang Hao Su and Leonidas J Guibas A point set generation network for 3d object reconstruction from a single image Proceedings of the IEEE conference on computer vision and pattern recognition 2017

7) Wang Nanyang et al Pixel2mesh Generating 3d mesh models from single rgb images Proceedings of the European Conference on Computer Vision (ECCV) 2018

8) Kato Hiroharu Yoshitaka Ushiku and Tatsuya Harada Neural 3d mesh renderer Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018

9) Zhang Yan ldquoCityMatrix An Urban Decision Support System Augmented by Artificial Intelligencerdquo hdlhandlenet1721190205

10) Geoplanner for ArcGIS produced by ESRI httpsdocarcgiscomengeoplanner 11) Google docs product about pagehttpswwwgooglecomdocsabout 12) Vanegas Carlos A et al ldquoProcedural Generation of Parcels in Urban Modelingrdquo Computer

Graphics Forum vol 31 no 2pt3 2012 pp 681ndash690 doi101111j1467-8659201203047x 13) Girdhar et al ldquoLearning a Predictable and Generative Vector Representation for Objectsrdquo

2016 arXiv160308637v2 14) He K et al Mask r-cnnrdquo arxiv preprint arxiv 170306870 (2017) 15) Ren Shaoqing et al Faster r-cnn Towards real-time object detection with region proposal

networks Advances in neural information processing systems 2015 16) Iandola Forrest et al Densenet Implementing efficient convnet descriptor pyramids

arXiv preprint arXiv14041869 (2014) 17) httpspjreddiecomdarknetyolo 18) httpsgithubcomjremillardimages-to-osm 19) httpsmediumcomgeoaireconstructing-3d-buildings-from-aerial-lidar-with-ai-details-6a81cb307

9c0 20) Wang et al ldquoProjective Analysis for 3D Shape Segmentationrdquo ACM Transactions on

Graphics (Proceedings of SIGGRAPH Asia 2013) 21) Gao L Yang J Wu T Yuan Y Fu H Lai Y Zhang H (2019) ldquoSDM-NET Deep

Generative Network for Structured Deformable Meshrdquo httpsarxivorgabs190804520

55