predicting future funding rounds ... - scripties.uba.uva.nl

Predicting Future Funding Rounds using Graph Neural Networks

submitted in partial fulfillment for the degree of master of science

Carlo Harprecht13060775

master information studiesdata science

faculty of scienceuniversity of amsterdam

2021-07-01

Internal Supervisor External SupervisorName Niklas Höpner Deep KayalAffiliation UvA - AMLab ProsusEmail [email protected] [email protected]

Predicting Future Funding Rounds using Graph NeuralNetworksCarlo Harprecht

Universiteit van AmsterdamAmsterdam, Netherlands

[email protected]

ABSTRACTWe investigate the applicability of graph neural networks (GNN) infuture funding round prediction of startups. We compare two state-of-the-art GNN architectures, namely R-GCN and HGT, againstrandom forest baseline models concerning their predictive per-formance. We find that end-to-end GNN models are not able tosignificantly outperform baseline models with aggregated graphinformation. However, relational features improve all models’ pre-dictive capabilities. As recent work proposes that end-to-end GNNsare not able to learn non-linear manifolds and perform feature de-noising, we create a pre-training GNN architecture. We create nodeembeddings by leveraging GNNs using a self-supervised objectivebased on link prediction. Following, we pass these embeddings toa downstream model. This pre-training architecture significantlyimproves all models’ predictive performance. In our experiments, adownstream random forest model was best able to predict futurefunding rounds of startups with a 𝐹0.5 = 70.5. Investment networksadditionally include valuable edge features that most existing GNNarchitectures are not able to process. Therefore, we modify theexisting R-GCN architecture to leverage edge features, which wecall "Edge-GCN". This architecture allows all downstream modelsto access edge features and subsequently improves the performanceof the final models.

KEYWORDSTechnology Startups, Company Success Prediction, Future Fund-ing Round Prediction, Machine Learning, Random Forest, DeepLearning, Graph neural networks, Graph Convolutional Networks,Relational-GCN, Heterogeneous Graph Transformer

1 INTRODUCTIONThe US alone is home to an estimated 171’000 technology startups[5, 21, 48]. Venture Capital (VC) investors play a crucial role in astartup’s survival and success [16, 20, 48]. They have to be ableto accurately assess startups for their investment decisions. Thistask is even more challenging for technology-focused companiesthat require expert knowledge to understand the startup’s product.However, assessing only the US 171’000 tech-startup companies,let alone all the remaining countries and their tech-startups, isintractable for individuals. If one wanted to take on this task in oneyear, it would require assessing 20 startups an hour, 24 hours a day.Only to start over the next year, with new information and newstartups. Subsequently, the task of at least partly automating theassessment of startup companies offers VC investors the possibilityto discover investment opportunities that otherwise would have

Master’s Thesis, 07/2021, Amsterdam, Netherlands2021.

been missed. With these clear benefits in mind, the problem ofstartup success prediction has recently attracted attention from themachine learning community [3, 9, 20, 37, 39, 42].

Next to classical company information, such as financial data,startups are also part of a network of investors and individuals.Previous studies have been able to show that a startup’s socialnetwork helps attract future rounds of funding [4, 37, 39, 51, 52].Investors tend to fund startups that they already have relations to[51]. Most of the studies focused on forecasting future investmentsby predicting links in the investment network [37, 51, 52] usingtraditional graph learning methods, such as random walks [37].

A startup’s social network can naturally be modeled as a graph,where investors, companies, and individuals form nodes and areconnected via edges representing relations such as "funded by"or "employed at". Recent advances in the field of graph learninghave enabled deep neural networks to be applied to graph data,leading to a class of models called Graph Neural Networks (GNN)[18, 43, 44]. Having shown promising performance for tasks suchas node classification and link prediction on datasets from the fieldsof citation networks and molecular biology [18, 44], it is natural toask whether GNNs can also be applied to the problem of startupsuccess prediction.

Investment networks differ from other networks as their edge fea-tures hold valuable additional information, such as investment dateor size (USD). Most existing GNN architectures focus on leveragingthe general graph structure as well as node features [18, 43, 44].Some recent architectures can account for different relation types[23, 38], but very few can leverage multi-dimensional edge features.We offer a simple extension to the existing R-GCN and GAT modelsthat can be applied to multi-dimensional edge features. Addition-ally, there has been an increased interest in the capabilities of GNNsto approximate non-linear functions [33] with work showcasingthat simpler architectures can achieve similar results to state-of-the-art models [47]. Therefore, we are investigating the ability ofgraph neural networks to work as classifiers and their potential togenerate de-noised feature vectors via self-supervised pre-training.

This thesis is concerned with the task of future funding roundprediction of startup companies using GNNs based on a dataset com-bining classical features with a startup’s position in an investmentnetwork. We add value to the field of startup success predictionand neural graph learning in four ways. Firstly, we further inves-tigate the value of various relational graph features for startupsuccess prediction. Secondly, we assess the prediction accuracy ofrecent graph neural network architectures compared to randomforest baseline models. Thirdly, we adapt a GNN architecture towork with multi-dimensional edge features and show empiricalevidence for its competitive performance. Lastly, we show that the

1

Master’s Thesis, 07/2021, Amsterdam, Netherlands Carlo Harprecht

predictive performance of all investigated models can be improvedby passing them pre-trained embeddings from a GNN trained ina self-supervised manner on the task of link prediction. Further-more, we pay close attention to potential data leakage, an oftenunderestimated problem in company success prediction [53].

The remainder of this thesis is set up into five parts. In section2, we give an overview of related work in the field of companysuccess prediction and background knowledge for the area of graphlearning and specifically GNNs. We also review existing studiesof company success prediction using graph learning. In section3 we motivate our task, introduce the methodology and explainapplied techniques and models. Here, we also introduce our modelarchitecture, the Edge-GCN (E-GCN). Afterward, we lay out theexperiments in section 4. In section 5, the results of the experimentsare shown and discussed. Finally, in section 6, we conclude the thesisand discuss directions for future research.

2 BACKGROUND AND RELATEDWORK2.1 Startup Success and Survival PredictionTraditionally, the focus of startup success prediction research hasbeen on leveraging company features such as founding team, found-ing date, past funding rounds and financial KPI [9, 27, 42]. Thesefeatures, often obtained from publicly available data sources suchas Crunchbase [20, 42, 53], are used to either predict the successof startup companies or their survival (e.g., not going bankrupt).Subsequently, binary classification is the predominant task.

As company success is inherently subjective, typical startupsuccess proxies include future founding rounds, initial public of-ferings, mergers and acquisitions or revenue growth and combina-tions of those [1, 4, 9, 27, 53]. Bankruptcy prediction tries to avoidthis subjectivity by predicting venture existence in the next years[2, 3, 10, 34]. Tree-based models, such as Random Forest and Gradi-ent Boosted Trees, consistently outperform other methods for bothcompany success and bankruptcy prediction tasks [1, 2, 34, 42, 53].

Depending on the task, available data and processing, resultsvary a lot. Since often the goal is to identify successful companies,precision-based metrics are frequently chosen [2, 20, 52, 53]. Preci-sion scores ranging from around 50% [20, 53] up to around 80% [2]have been achieved.

2.2 Neural Graph LearningIn general, a graph𝐺 = (𝑉 , 𝐸) is described as a set of nodes𝑉 and aset of edges 𝐸, that connect these nodes [17]. In the field of companysuccess prediction, we often deal with multi-dimensional graphsthat have different types of relations, 𝑟 . Therefore, we denote anedge of type 𝑟 that connects nodes 𝑢 and nodes 𝑣 as (𝑢, 𝑟, 𝑣) ∈ 𝐸. Inthe case that (𝑢, 𝑟𝑖 , 𝑣), (𝑣, 𝑟𝑖 , 𝑢) ∈ 𝐸 the relation 𝑟𝑖 is said to be undi-rected. If all relations in a graph G are undirected, G is referred to asan undirected graph. A special type of multi-dimensional graph thatincludes different relation types and different node types is called aheterogeneous graph. An example for a triple describing a semanticrelation in a heterogeneous graph of an investment network is therelation "funds" between the two node types "investor" and "com-pany", denoted as ("investor", "funds", "company"). Additionally, agraph can be equipped with node features as well as edge features.Typical tasks in the field of machine learning on graphs include

node classification (predicting a property of a specific node), linkprediction (predicting the existence of a specific relation betweentwo nodes) and clustering. [17]

Recently, new advances in the field of graph learning have beenmade by applying deep learning approaches to graph-structureddata [18, 29, 43]. These GNNs compute node embeddings by itera-tively passing messages between nodes along the edges of a graph,followed by an update of the node embeddings via an aggregationof the incoming messages [29]. Node embeddings can be trainedin an unsupervised manner by predicting whether two nodes areneighbors (e.g., link prediction) from their respective node embed-dings [18] or trained for a specific task using a supervised objective(e.g., node classification) [29]. Equation 1 shows the general ideaof first extracting all neighboring nodes embeddings ℎ 𝑗∀𝑗 ∈ 𝑁 (𝑖),including the target node’s embedding ℎ𝑖 , and aggregating theseembeddings to obtain the new target node embedding ℎ𝑙+1

𝑖[23].

ℎ(𝑙+1)𝑖

← 𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑒∀𝑗 ∈𝑁 (𝑖),∀𝑒∈𝐸 ( 𝑗,𝑖) (𝐸𝑥𝑡𝑟𝑎𝑐𝑡 (𝐻 𝑙 [ 𝑗];𝐻 𝑙 [𝑖], 𝑒))(1)

State-of-the-art performance has been achieved on benchmarkdatasets for several link prediction and node classification tasks us-ing GNN models [18, 29, 43]. Graph Convolutional Networks (GCN),adapted from the field of Computer Vision (CV) to work on graph-structured data [29], are frequently used. As opposed to Convolu-tional Neural Networks (CNN) known from CV, these models usethe same weight matrix for all neighboring nodes instead of a grid(kernel) of weight vectors.

2.2.1 Relational-GCN. A special architecture of GCNmodels, calledRelational-GCN (R-GCN), is able to model different types of rela-tions [38]. This is particularly useful for heterogeneous graphs,where different types of relations often should not be modeled inthe same way as they represent differing concepts. The embeddingof a node given its neighbors is therefore computed by aggregatingall neighbors embeddings per relation type:

ℎ(𝑙+1)𝑖

= 𝜎 (∑𝑟 ∈𝑅

∑𝑗 ∈𝑁 𝑟

𝑖

1𝑐𝑖,𝑟

𝑊(𝑙)𝑟 ℎ

(𝑙)𝑗+𝑊 (𝑙)0 ℎ

(𝑙)𝑖) (2)

Equation 2 differs from the original GCN implementation [29]in that the model can distinguish between different relations bylearning a different weight matrix per relation 𝑟 ∈ 𝑅 [38]. It alsodirectly adds the current nodes embedding ℎ (𝑙)

𝑖to the new node

embedding after transforming it with a separate weight matrix𝑊(𝑙)0 . The 1

𝑐𝑖,𝑟term is used to normalize scores for different node

degrees, where 𝑐𝑖,𝑟 = |𝑁 𝑟𝑖| and 𝜎 describes a non-linear activation

function, to add non-linearity to the model.

2.2.2 Graph Attention Networks. Another advanced GNN architec-ture is the Graph Attention Network (GAT) [44] and its extension,the Heterogeneous Graph Transformer (HGT) [23]. GAT makes useof the attention mechanism, popular in deep learning models fornatural language processing [23, 44]. The 𝛼𝑖 𝑗 in equation 3 repre-sents the learned attention score between target node i and sourcenode j [25, 44]. The HGT extends this formula by adapting it toheterogeneous graphs by, for example, learning a distinct set of pro-jection weights per relation [23]. A detailed overview of the HGT

2

Predicting Future Funding Rounds using Graph Neural Networks Master’s Thesis, 07/2021, Amsterdam, Netherlands

architecture can be found in figure 6 in the appendix. The HGTmodel can learn the relative importance of specific node featuresand incoming messages. The model outperforms other GNN archi-tectures on benchmark data sets (namely the open academic graphand the computer science as well as medicine academic graphs)[23].

ℎ(𝑙+1)𝑖

= 𝜎 (∑𝑗 ∈𝑁𝑖

𝛼(𝑙)𝑖 𝑗

𝑊 (𝑙)ℎ (𝑙)𝑗) (3)

2.2.3 Shortcomings of GNNs. Due to their recent introduction,GNN models have mostly been tested on a few benchmark datasets[40]. However, some studies apply recent GNN models to differingdatasets and compare them to baseline models, often tree-basedarchitectures [12, 13, 25]. These comparisons highlight potentialshortcomings of the GNN models. The GNNs are often not ableto outperform the baselines [13, 14, 25] with respect to their pre-dictive quality, return mixed findings based on the settings [12] oronly slightly improve over the baseline models [45]. For example,in a very recent study in the field of drug discovery, the authorsare unable to outperform a random forest baseline model usingadvanced GNN models such as GCN and GAT [25]. Depending onthe setting, the baselines even outperform the more advanced GCNmodels.

In another study, the authors claim that GNN models in general,do not have the non-linear manifold learning property [33]. Thisleads to poor performance when solely stacking GNN layers tocreate the graph learning model, resulting in overfitting. They alsoidentify the models’ ability to perform low-pass filtering to de-noisefeatures. Additionally, the authors find evidence that GNNs performpoorly in situations of non-linear feature space and advocate forviewing GNN layers as a de-noising mechanism.

2.3 Startup Success Prediction using GraphLearning

In an early approach, researchers were able to predict links betweenstartups and investors using relational data [51]. They aggregategraph information with techniques such as shortest path featuresand use a support vector machine to show that investors tend to in-vest in companies they already have relations to. Other approachessuch as preferential attachment or random walks also have beenexplored [37, 52].

Using tree-based models, such as Random Forest, correlationsbetween a companies relations to investors and its success canbe found [35, 39]. Most research in the area of predicting startupsuccess so far uses data obtained fromCrunchbase and has been ableto show benefits of using graph learning techniques [14, 37, 39, 51,52]. Other data sources have rarely been explored. The applicationof advanced text processing methods to enrich node features aswell as using an enriched investment network has been suggestedas an important avenue for future work [37].

Due to the recent advances in graph learning as described in sec-tion 2.2, there have been some experiments to use GNNs for fundinground prediction [14]. The authors combine classical company fea-tures such as founding date, funding rounds, investor informationand graph centrality measures with a simple graph relating startupsand investors. Using the GCN architecture [29], the authors are

not able to outperform a random forest baseline model, with an 𝐹1score of 0.63 for both models (AUC: 0.63) [14]. This is in line withstudies discussed in section 2.2.3. However, they do report higherprecision values for the GNN model than the random forest model.Their GCN model achieves better results for seed funding roundsthan for later ones, as the authors claim a startup’s network is moreimportant for earlier funding rounds.

3 METHODOLOGY3.1 Motivation of TaskAs a first step, company success prediction calls for defining thesuccess of a startup. In general, this is a very subjective measure[1, 53]. However, the occurrence of a funding round is an objective,well documented-event, that in theory, should be related to thesuccess of a startup business [14, 39, 53]. A funding round showsstrong interest by an informed agent and is, therefore, a powerfulindicator of successful business operations [39]. Also, even theability to predict funding rounds by itself is valuable, regardless ofthe correlation with subjective success. Investors are hereby ableto identify trending companies and areas of operations early. Asstartups need around 12-18 months on average to raise new capital[8], we consider a prediction time frame of 24 months.

3.2 Exploratory Data AnalysisThe data used for this project contains information on approxi-mately 12’000 technology-focused startups, as defined manually byour proprietary data provider (PDP). Furthermore, the data containsinformation on around 18’000 investors and 200’000 individuals.However, not all of these investors and individuals are connectedto these technology startups.

The data is obtained from our PDP as opposed to previous workthat largely relies on freely accessible Crunchbase data (see 2.3).An overview of the data and features used in this project is givenin table 2 in the appendix.

Headquarters of most technology startups, as well as investors,are located in the US (38% and 32%, respectively), followed by China(14%, 8%) and the UK (7%, 5%). The companies were founded be-tween 1980 and 2021, where most of the tech startups were foundedbetween 2015 and 2018 (57%). Most investments into the startupswere recorded in 2019 (10’508, 21%), but the highest total amountof money was invested in 2018, with 57.1 B USD. The average in-vestment round size for the tech startups in the data is 3.9 M USD(median: 1.1 M USD), and the mean number of days between twoinvestment rounds per company is 302 days (standard deviation:226 days, median: 262 days). This high variance further strengthensour choice of a test period of 24 months.

3.3 Data pre-processingFeatures that contain more than 50% missing values are excludedfrom the analysis. This, for example, concerns information on fi-nancial KPIs for startups. Companies founded before 1990 or com-panies that already received investments exceeding 1 B USD areexcluded from the analysis as these do not fit the focus of this work.Companies this old or this highly financed already should not beconsidered startups anymore and are not the kind of company that

3


typically appeals to VC. Missing values are filled using standard im-putation techniques designed with expert knowledge. For example,missing deal sizes are filled using the median deal size per investor,missing patent information with 0’s, and missing information inthe company reviews are filled using a dummy value of -1.

We remove all time-sensitive information that does not havea timestamp attached as the impact of potential data leakage cannot be assessed. For example, if we use typical features such as"financing stage" of a startup without knowing the date this statuswas last changed, we potentially give the model information itwould not have in a real-world scenario. Additionally, companydatabases often only include companies with previous investmentrounds. This can also lead to data leakage, as the probability for afuture funding round of a company with no deal in the training timeframe (see 3.4.1) would be very high. Themodels could easily exploitthis pattern to inflate their performance artificially. Subsequently,companies without a deal in the train data are excluded from thedata set.

Finally, categorical variables are transformed to be numericalusing one-hot encoding. All features are standardized to have amean of 0 and a standard deviation of 1.

3.4 Data aggregation and combinationSome of the available data has to be aggregated to be accessible tothe models. For example, the time-series data has to be encoded ina tabular form to be used in the baseline as well as the GNN models(these do not have a recurrent network structure). So does therelational data for the baseline models and the text data. Afterward,we combine the data from the multiple data sources into one datasetfor the modeling part. A complete overview of the final variablescan be found in table 2 in the appendix.

3.4.1 Test date. To predict future funding rounds, we choose a testdate. This date splits all data in training features available to themodels during training and label data. We pretend the current dateis equal to the test date and all data beyond this date is unknown tous. Subsequently, all data after the test date will not be available tothe model. The label data is necessary to create the labels in section3.4.2, which indicate if a company attracted a funding round afterthe test date. To obtain predictions for the next 24 months, the testdate is the 01-02-2019.

3.4.2 Target label. We create the labels for the supervised learningtasks based on the label data. All investments into the technologystartups after the test date are transformed into binary labels, in-dicating if a given company received an investment or not (seeequation 4, where 𝑑𝑖 describes deals in test data of company i). Asa result, we obtain labels describing if a given company obtainedfunding in the future from the perspective of the test date. Thisallows us to train a model to predict future funding rounds.

𝑦𝑖 =

{0, 𝑖 𝑓 𝐶𝑂𝑈𝑁𝑇 (𝑑𝑖 ) = 01, 𝑖 𝑓 𝐶𝑂𝑈𝑁𝑇 (𝑑𝑖 ) > 0 (4)

3.4.3 Textual Data. Additionally, we cluster companies into 60groups, based on their company description using GPT-3 [7]. First,we manually annotate around 20 company descriptions with key-words. Next, we apply the few-shot capabilities of GPT-3, inputting

this handful of examples to extract meaningful keywords from allother company descriptions. The model is fed the full descriptionand predicts n keywords for each company. Using GPT-3 SemanticSearch, we compute pairwise semantic similarity scores between themost frequent 200 keywords. Finally, we cluster this similarity ma-trix using agglomerative hierarchical clustering from Scikit-Learn[36] to result in 60 clusters of keywords. We assign companies tokeyword clusters, where a company can be part of multiple clusters.

3.4.4 Time-series Data. The time-series data is aggregated (mean)on a monthly (website traffic) or a yearly (employee counts) basis.All data after the test date is discarded. The remaining data is addedas a feature column per period to the tabular data per company.

3.4.5 Relational Data. We aggregate the funding round informa-tion over time by creating various simple aggregation features. Wecreate features describing the most recent funding event of a startup("days since last round", "last round’s size (in M USD)"), the sameinformation for the first funding round ("days since first round","first round’s size (in M USD)"), the total number of funding roundsand the total amount of funding a startup acquired.

Further advanced graph information are added using the eigen-vector centrality of the companies and their main investor definedas: 𝑒𝑖 = 1

_

∑𝑗 ∈𝐽 𝐴[𝑖, 𝑗]𝑒 𝑗∀𝑖 ∈ 𝐽 , where _ is a constant and J de-

scribes all neighbors of node i [17]. Eigenvector centrality differsfrom the simple node degree, as it also considers the importanceof a node’s neighbor. An intuitive view on it is that it capturesthe likelihood that a node is being visited on a random walk ofinfinite length in the graph [17]. Also, the number of board seatholders, headquarter location of the main investor, and the numberof similar competitors is included.

3.4.6 Data Split. The pre-processing and cleaning steps result inaround 8’000 companies. Finally, we split these randomly into threeparts, training, validation, and test data (80%, 10%, 10%). The spar-sity of target labels across the three sets is 0.541, 0.521, and 0.548,respectively.

3.5 Baseline Models3.5.1 Simple Baseline (B0). A simple baseline is just predicting themost common class (0) for all examples. This model will alwaysachieve an accuracy of exactly the sparsity scores described insection 3.4.6 (54.8% on test data).

3.5.2 Random Forest Baseline 1 (RF B1). The first baseline model isa random forest model from the Python library Scikit-Learn [6, 36].We choose a random forest model as it forms little prior assumptionabout the data [6], performed well in previous work [9, 20] andalso performs best on our dataset when compared to other models(Logistic Regression, Decision Tree, AdaBoost, Gradient BoostedMachine). We train it on the tabular company features as well asthe aggregated time-series and text features. A detailed overviewof the features available for each model can be found in table 2 inthe appendix. Additionally, it uses the simple aggregated invest-ment features from section 3.4.5. Notice that these features alreadycontain some aggregated information about a company’s neigh-borhood in the investment network. The model’s hyperparametersare tuned using Grid Search and Cross-validation on the train data.

4


Finally, we choose the best model based on the validation data 𝐹0.5(defined in 3.8) score and test its final performance on the test data.

3.5.3 Random Forest Baseline 2 (RF B2). To assess the importanceof relational features on the model’s predictive capabilities, we alsocreate a second baseline model. The setup is identical to the setupfor the first baseline model. However, the second random forestclassifier is additionally able to access advanced aggregated graphfeatures. Therefore, we can compare the influence of these featuresunder otherwise unchanged conditions.

3.6 Graph ConstructionAll graph operations are performed using the python graph libraryDGL [46]. We create two different undirected and heterogeneousgraphs.

The first graph contains the three main node types from therelational data sets, "company", "investor", and "person" (also called"individual"), as displayed in the meta graph illustration in figure1. These primary nodes are colored in dark blue. We also add theircorresponding node features. The loop concerning the companynodes reflects the similarity relation between two companies.

Our data also includes valuable edge features, such as investmentdate and investment size (USD). However, most of the establishedGNN architectures are not able to leverage edge features [38, 43, 44].Therefore, the second graph additionally contains nodes represent-ing edge relations. These artificial nodes (light blue) hold the edgefeatures. This adds further complexity to the graph but allows forusing edge features directly in the model without altering theirarchitectural design. All edges between these artificial nodes andthe original dark blue nodes are undirected as well.

Figure 1: Meta graph

3.7 Graph Neural NetworksWe implement the GNNmodels as described in section 2.2 followingthe DGL implementations [46]. We add a dense, fully connectedlinear input layer for each node type to the models to allow fordiffering feature input sizes. Further adjustments include variablehidden layer sizes and a dense linear output layer. The R-GCNmodel uses the leaky RELU activation function [49], whereas theHGT model uses the GELU activation [19]. The models are trainedfor 200 epochs using the AdamW optimizer [28, 30], a learningrate scheduler [41] (max learning rate = 0.001) and early stopping(patience = 20 epochs).

3.7.1 Edge-GCN. Additionally, we create our own model architec-ture, based on ideas from the R-GCN (2.2.1) and the GAT (2.2.2), also

related to the EGNN [15]. The main difference is the models abil-ity to directly leverage multi-dimensional edge features, withouttransforming them to artificial nodes (see 3.6). We therefore call thearchitecture defined in equation 5 Edge-GCN (E-GCN). Note, thatin the following indices 𝑣 and 𝑒 indicate weights related to nodesand edges, respectively, while indices 𝑖 and 𝑗 refer to specific nodes.Here,𝑊 (𝑙)𝑣,𝑟 describes a weight matrix per relation for node featuresat layer l,𝑊 (𝑙)𝑒,𝑟 is a weight matrix per relation for edge features. Thetwo transformed feature embedding vectors (where ℎ𝑣,𝑗 representsnode feature embeddings and ℎ

(0)𝑒,𝑖 𝑗

represents edge features) arestacked on top of each other. The attention mechanism 𝛼𝑟 is usedto learn the relative importance between edge and node featuresper relation type r. To normalize the values for node degrees weset 𝑐𝑖,𝑟 = |𝑁 𝑟

𝑖|. The weight matrices use the dropout mechanism

(with dropout probability of 0.2), as does the HGT model. Here, 𝜎represents the non-linear leaky RELU [49] activation function.

ℎ(𝑙+1)𝑣,𝑖

= 𝜎 (∑𝑟 ∈𝑅

∑𝑗 ∈𝑁 𝑟

𝑖

1𝑐𝑖,𝑟

𝛼𝑟 ( [𝑊 (𝑙)𝑣,𝑟 ℎ(𝑙)𝑣,𝑗

,𝑊(𝑙)𝑒,𝑟 ℎ

(0)𝑒,𝑖 𝑗])) (5)

3.8 EvaluationAs mentioned in section 2.1, the focus in startup success predictionin general, and also this work is on precision-based metrics. Toprevent the model from only predicting the companies it is mostcertain about and thereby limiting its practical applications, recallmust also be considered. Subsequently, we choose the 𝐹𝛽 score asour main evaluation metric. The calculation for our binary labelcase is straightforward. It weighs precision against recall, based on𝛽 , as shown in formula 6. 𝛽 > 1 weighs recall over precision, 𝛽 = 1weighs them equally and 𝛽 < 1 favors precision. We use a 𝛽 = 0.5.

𝐹𝛽 =(1 + 𝛽) (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙)𝛽2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

(6)

Additional evaluation metrics include accuracy, area-under-the-curve (AUC), precision, recall, and precision at top-k predictions(precision@k).

4 EXPERIMENTSWe tune the hyperparameters of the GNN models and choose thebest performing models using grid search based on their validation𝐹0.5 score. Due to high variability caused by the random initiationof parameters, we repeat each experiment 5 times, with differentseed values and average results. For the R-GCN model, we use 4layers and 256 embedding dimensions; for the HGT, we use 2 layersand 128 embedding dimensions; and for the E-GCN, we use 3 layersand 256 embedding dimensions. We use early stopping based onthe validation 𝐹0.5 score and test the models performance on thetest data. An overview of tested hyperparameters can be found intable 3 in the appendix.

We create two distinct model learning architectures. The firstarchitecture is a complete end-to-end GNN model for node classifi-cation. The second architecture is a pre-trained model. The pre-textmodel is a GNN architecture for link prediction. Here, pre-textdescribes training a model on a different task to generate the em-beddings. Following, the embeddings are used by a downstream

5


model for node classification. Their final objective is still nodeclassification using the labels described in 3.4.2.

4.1 End-to-End Node Classification ModelsThe objective function for the node classification models is binarycross-entropy loss from equation 7 on the labels from 3.4.2. We testmodels that only use node information, models that leverage edgeinformation by transforming edges to artificial nodes (see 3.6), andalso models that do not use any node features at all (only the graphstructure). As these models learn to classify nodes directly, we alsocall them end-to-end models.

𝑙𝑛 (𝑥,𝑦) = 𝑦𝑛 ∗ 𝑙𝑜𝑔(𝑥𝑛) + (1 − 𝑦𝑛) ∗ 𝑙𝑜𝑔(1 − 𝑥𝑛)) (7)

4.2 Pre-trained ModelsThe objective function for the pre-text GNN link prediction modelsis a margin loss defined in equation 8, following the DGL imple-mentation [46]. The idea is to increase the score for the actual edge𝑥𝑢,𝑣 between nodes u and v and decrease the models’ scores for allnegative samples 𝑣𝑖 𝑃𝑛 (𝑣), 𝑖 = 1, ..., 𝑘 , thereby enabling the modelto learn the general graph structure. 𝑃𝑛 (𝑣) describes an arbitrarynoise distribution. k is a constant, describing the number of nega-tive sampled edges. We use 𝑘 = 10. We also use the graph samplingfunctionality provided by DGL [46].

We use R-GCN and E-GCN models for the link prediction task,with 256 hidden embedding dimensions and 3 layers. These hyper-parameters proved to work best for this particular task. The datasplit into train, validation, and test is kept exactly the same as forthe node classification task (see 3.4.6) to avoid data leakage.

𝐿𝑛 (𝑥) = 1 − 𝑥𝑢,𝑣 +∑

𝑣𝑖∼𝑃𝑛 (𝑣),𝑖=1,..,𝑘

1𝑘𝑥𝑢,𝑣𝑖 (8)

The downstream model is either another GNN (HGT or R-GCN)or a random forest model. The input for the downstream GNNmodels is the same graph structure, but the node features are theembeddings generated by the pre-text link prediction model andoptimized using binary cross-entropy loss from equation 7. Con-trary, the random forest model uses only the embeddings from theupstream link prediction model as tabular input features to predictthe binary target labels from 3.4.2.

5 RESULTSThe results of the experiments are shown in table 1. All scoresare the average of 5 runs using different seeds on the test data.Information on standard deviations between runs is indicated inbrackets. In general, we observe high standard deviations for thep@k scores, as these are calculated based on only a few (k) instancesand thereby subject to high variability. Also, recall scores showhigh variability, whereas all other metrics show robust results. Weobserve higher precision than recall scores as we select the bestmodels based on an 𝐹𝛽 score that favors precision over recall. Wedeem differences statistically significant if 𝑀𝐸𝐴𝑁𝑖 + 2 ∗ 𝑆𝑇𝐷𝑖 >

𝑀𝐸𝐴𝑁 𝑗 +2∗𝑆𝑇𝐷 𝑗 , hereby applying an 𝛼 ≈ 0.05, assuming a normaldistribution of the results.

All pre-trained models use the embeddings created by the E-GCNlink prediction model from section 3.7.1. The E-GCN outperformed

the R-GCN model on the link prediction task in terms of accuracyas well as its embeddings proved to be more informative, as theE-GCN model is also able to also leverage multi-dimensional edgefeatures. Subsequently, the E-GCN hereby enables all downstreammodels to also indirectly access edge features.

5.1 Baseline Model Results5.1.1 Evaluation and Comparison. We performed a grid searchwith cross-validation to determine the best hyperparameters forthe models B1 and B2: class-weight = balanced, max-depth = 30 with200 estimators. Both random forest models B1 and B2 are clearlyable to capture significant statistical information compared to thesimple B0 model. The random forest models perform similarly onthe test data. One has to remember that the first baseline model (RFB1) uses 92 features, whereas the second baseline (RF B2) uses 7more features, 99. However, B2 is able to outperform B1 in most ofthe crucial metrics, including 𝐹0.5. We can, therefore, already statethat graph features, even on a highly aggregated basis, seem tobring some value in predicting future funding rounds for startupsin the technology space.

5.1.2 RF B2 Features. When investigating the feature importance(mean decrease in impurity) plot derived from the B2 model, wecan see the importance of aggregated graph features. In figure 2,we find the features "board seats" (number of board seat positions),"eigen cen" (companies eigenvector centrality), and the eigenvectorcentrality of a companies main investor ("inv eigen cen") to beamong the top-10 most important features (in red). Also, aggregatedtime-series features prove helpful, even though not contained inthe top-10 (in green). We also observe that the pre-processed textfeatures (from 3.4.3) are not considered to be very important by theRF model. A complete description of features can be found in table2 in the appendix.

0.00 0.01 0.02 0.03 0.04 0.05Feature Importance (mean impurity decrease)

traf_2018-4traf_2018-10traf_2018-9

traf_2018-11traf_2018-12

inv_hq_latinv_hq_longtraf_2019-1

total_roundslast_size

HQ_latfirst_sizeHQ_long

inv_eigen_cenboard_seats

YearFoundedtotal_raisedfirst_finance

eigen_cenlast_finance

Feat

ure

Nam

e

Feature Importance Plot Random Forest B2 Model (Top-20)

Tabular FeatureAggregated Graph FeatureTS Feature

Figure 2: Top-20 most important features of RF B2 model,aggregated graph features are among the top-10most impor-tant features.

The partial dependence plots of figure 7 in the appendix forselected variables of the B2 model show how the values of a spe-cific feature influence the target label prediction probability. Thefeature values are not meaningful by themselves, as their values

6


Model Input Edge-feat Accuracy Precision Recall AUC P@20 P@50 𝐹0.5Pre-trained RF emb x 72.3 (0.2) 73.5 (0.1) 60.9 (0.3) 78.1 (0.1) 96.7 (2.4) 86.0 (0.0) 70.5 (0.0)Pre-trained HGT emb + G x 70.8 (0.0) 72.1 (0.2) 58.0 (0.4) 76.6 (1.3) 88.0 (2.2) 83.0 (1.8) 68.4 (0.0)Pre-trained R-GCN emb + G x 70.0 (0.3) 71.8 (0.9) 55.6 (2.1) 77.1 (0.5) 99.0 (2.2) 90.0 (1.4) 67.8 (0.3)RF B2 tab - 64.2 (0.6) 61.2 (0.9) 57.5 (4.3) 69.1 (0.2) 72.5 (2.9) 68.0 (1.6) 60.3 (0.7)R-GCN G (x) 63.6 (0.2) 59.4 (0.2) 62.1 (2.2) 67.6 (0.4) 81.7 (10.6) 72.5 (2.8) 59.9 (0.3)HGT G (x) 63.2 (1.4) 59.4 (2.4) 59.4 (3.2) 67.1 (1.0) 87.5 (3.5) 76.0 (2.8) 59.4 (1.3)E-GCN G x 63.4 (1.0) 60.4 (0.7) 55.4 (6.3) 67.4 (0.4) 68.3 (7.6) 67.3 (4.2) 59.3 (1.6)R-GCN G - 62.5 (0.5) 58.1 (1.5) 61.9 (6.0) 66.6 (1.2) 77.5 (3.5) 73.0 (4.2) 58.8 (0.1)RF B1 tab - 62.5 (0.6) 58.5 (1.6) 59.6 (5.5) 67.0 (0.8) 72.5 (2.9) 67.5 (1.9) 58.6 (0.5)HGT G - 61.2 (0.2) 57.3 (0.4) 56.4 (2.4) 65.1 (0.3) 70.0 (13.2) 69.3 (3.2) 57.1 (0.3)R-GCN G-rand - 57.8 (1.5) 53.0 (1.6) 60.3 (2.2) 61.9 (2.1) 66.7 (7.6) 72.0 (3.5) 54.3 (1.0)HGT G-rand - 54.0 (7.2) 50.8 (4.9) 69.6 (22.2) 57.9 (8.5) 53.3 (20.2) 56.0 (19.7) 52.9 (2.2)E-GCN G-rand - 49.2 (6.9) 47.4 (3.8) 92.0 (13.8) 52.3 (10.1) 48.3 (30.6) 54.7 (19.4) 52.3 (2.6)Simple B0 none - 54.8 (0.0) 0.0 (0.0) 0.0 (0.0) 50.0 (0.0) 0.0 (0.0) 0.0 (0.0) 0.0 (0.0)

Table 1: Model results on test data, sorted by 𝐹0.5, brackets indicate standard deviations (G = Graph, emb = embeddings, tab =tabular, feat = features, rand = random (features)).

were standardized. The model predicts a higher probability of afuture funding round for younger startups than older ones. Perhapsyounger startups need capital more often in shorter periods. Also,more rounds and more capital raised in the past increase the proba-bility but quickly reach a plateau for strong outlier values. The samegoes for the number of board seats. All these findings are some-what intuitive. Also, a high eigenvalue centrality score correspondsto a higher probability of a future funding round, illustrating theimportance of company-investor relations.

The more recent the last funding round was, the higher theprobability for a future funding round, which is unexpected. Thiscould be interpreted as a signal that the startup is active and some-how successful. The size of the last financing round seems to havelittle impact. More website traffic and more employees increasethe probability of a future funding round, which is also intuitive.Additionally, more similar competitors increase the probability offuture funding. Potentially, this could be interpreted as a signal ofa generally trending area of operations.

A general pattern in the partial dependence plots in figure 7seems to be that often strong outlier values do not highly influencethe probabilities due to plateauing relations. This can be attributedto the robustness-to-outlier property of random forest models thatthey inherit from the underlying decision trees [26].

5.2 End-to-End GNN Results5.2.1 Evaluation and Comparison. The GNN models’ hyperparam-eters that were found to work best are shown in table 3 in theappendix. Table 1 shows how the end-to-end R-GCN and HGTmodels trained with artificial nodes (to leverage edge features indi-rectly) outperform the other models without edge information. Thedifference in performance between the R-GCN with edge informa-tion and the R-GCN without is statistically significant. The E-GCNmodel for node classification (end-to-end model) is also able to out-perform the R-GCN and the HGT models without artificial nodes.This shows how models that can leverage edge features performbetter than models without these features. The more recent HGTarchitecture is unable to outperform the R-GCN architecture. In

fact, the R-GCN performance on 𝐹0.5 is higher than for the HGT,however not statistically significant.

Finally, the GNN model architectures trained on randomly cre-ated features perform the lowest. The differences in 𝐹0.5 are signif-icant. This is in line with expectations. However, their scores areabove random baseline expectations and indicate that the modelsare still able to leverage some (minor) information from the generalgraph structure. Here, the R-GCN outperforms the HGT, which inturn performs better than the E-GCN model. However, the resultsare subject to very high variability. This shows how the R-GCNmodel makes better use of graph-structured information. The highrecall score for the E-GCNmodel stems from the fact that the modelprimarily predicts the positive class. However, the other twomodelsalso show high recall scores. Apparently, the graph structure aloneprovides some meaningful, rather general information. Previouslyidentified essential features such as "total rounds" are encapsulatedby the node degree in the graph structure and leveraged by themodels.

The end-to-end node classification GNN models are not ableto outperform the second baseline model B2, which uses aggre-gated graph information, as shown in table 1. In fact, B2 with an𝐹0.5 = 60.3 outperforms all end-to-end GNN models. The differenceis only significant for models that do not have access to edge fea-tures. This highlights how much of the information in the graphcan easily be aggregated and used with classical machine learn-ing models. The aggregated graph features are highly informativeand build the backbone of the predictive capabilities of the mod-els. These models are comparably simple and already show strongperformance. However, the models that can directly or indirectlyaccess multi-dimensional edge features are able to outperform B1.This further strengthens the hypothesis that graph features provideat least some value for the task of future funding round prediction.Specifically, edge features prove valuable.

One explanation of these results is that aggregating structuralgraph information (such as the number of incoming nodes, connec-tivity) can be easily done for our rather simple graph. Edge features,however, include valuable information and are much harder to be

7


meaningfully aggregated. Our attempts to do so using the simpleinvestment aggregation features in section 3.4.5 already createdsome of the most important features for the models (e.g., "days sincelast funding round"). These are also available to the B1 model, ex-plaining its already strong performance. We can see this also whencomparing the B1 model to a random forest model trained withoutthe simple investment aggregation features from section 3.4.5. Thismodel only achieves an 𝐹0.5 = 56.5 (accuracy = 61.2), lower thanthe B1 model. This also explains the inability of the GNN modelswithout edge information to outperform the first baseline model B1.The most useful (aggregated) information is already available to theB1 model, diminishing the GNNs advantage. Therefore, leveragingthe full edge information increases the performance of the GNNmodels, whereas the structure of the graph only describes a subsetof the relevant information.

5.2.2 E-GCN Features. Unfortunately, classical deep learning ex-plainability frameworks are not directly applicable to GNN models.Even though there exist studies that try to address this issue [24, 50],their usability and support for various model types as well as li-braries are currently still lacking. Therefore, we naively assessfeature importance by dropping single features and relations fromthe graph and compare the resulting model to the baseline GNNtrained on the entire graph. Retraining multiple models for eachfeature would be computationally demanding and prone to initial-ization noise. Therefore, we use the complete model and createnew predictions using a graph with masked out (replaced by ran-dom values) information. However, this comes with the caveat thatthe model can not re-learn duplicated information from anotherfeature.

We find that the E-GCN model is more strongly influenced bymasking some relations than features, as shown in figure 3. Themost substantial influence is due to removing information of indi-viduals working at the startups ("employs com"), followed by thefunding relations ("funds"). Hiding one of this relational informationfrom the model strongly impacts its performance. Following, thenode features describing the number of board seats, the foundingyear, and the days since the last financing have the most impact onthe model’s performance. Again, the features already identified byearlier experiments also strongly influence the E-GCN model’s per-formance. Even though our dataset only contains information aboutthe individual’s role at the company, these relations are alreadyhighly informative for the model.

5.3 Pre-trained Models5.3.1 Evaluation and Comparison. The pre-trained GNN modelssignificantly outperform all end-to-end GNN architectures by anincrease in F-0.5 of around 10-17%. The best-performing pre-trainedGNN model is the HGT model on top of the E-GCN link predic-tion embeddings. However, the best performing model architectureoverall is the pre-trained random forest model. It significantly out-performs all other models in most of the relevant metrics. Thepre-trained models show exceptionally high p@k scores, indicatingtheir top predictions are highly accurate. The pre-trained modelsmost strongly outperform the other models in terms of precision,with similar recall.

empl

oys_

com

fund

s

boar

d_se

ats

tota

l_rou

nds

last

_fin

ance

Year

Foun

ded

simila

r_co

m

HQ_lo

ng

traf_

2019

-1

first

_size

Feature Name

0.08

0.06

0.04

0.02

0.00

Impa

ct o

n f-0

.5 sc

ore

Naive Feature Importance of E-GCN Model (Top-10)

RelationNode Feature

Figure 3: Naive Feature Importance of E-GCN Model, rela-tions influence predictive capabilities most strongly.

When using node classification embeddings, the pre-trainedmodels were merely able to match the end-to-end models results.Therefore, we attribute the superior performance of the link predic-tion pre-trained models to the self-supervision process to generatethe node embeddings. We further investigate the embeddings insection 5.3.2.

The self-supervised training is much less prone to random noise,as it only differentiates positive examples from negative ones (seesection 4.2), whereas the target labels can be very noisy. Thesefindings are in line with previous observations from the fields ofCV [32], and NLP [22]. In NLP, it is a common technique to traina model with self-supervised labels and use the word embeddingsfor a downstream task. This can increase the number of trainingexamples and decrease the noise inherent in the data [22, 32]. Wecan observe similar patterns in our GNN models as well.

Investigating the precision-recall scores of the models (precisionand recall values for differing classification thresholds) in figure 4,we again observe substantial differences between the pre-trainedmodels and the other models. Independent of the threshold value,the pre-trained models achieve higher precision scores at the samerecall levels. The B2 model seems to have difficulties with its verytop predictions, which tend to be wrong more often. The opposite istrue for the pre-trained models, which achieve very high precisionscores for their top predictions. These results are similar to previousstudies [14]. In addition, we observe a sharper decline towards theend of the precision-recall curves. This indicates the problem ofsome companies (around 20%) being very challenging to identifyas positive examples. When training the models with differing 𝐹𝛽values, we can achieve higher recall scores but with high costs inthe accompanying metrics. These companies resemble inherentnoise in the data, where recall scores above 80% are hardly reached.

5.3.2 Embeddings. The embeddings obtained from link predictionprovide significant value for the node classification task. The finalnode embeddings of the E-GCN model for link prediction and nodeclassification in figure 5 show substantial differences (the samepattern holds for the R-GCN model as well). The 256-dimensionalembeddings are transformed to two dimensions using the UMAP

8


0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Prec

ision

0.220.41

0.51

0.59

0.21

0.4

0.490.59

0.06

0.14

0.39

0.73

0.03

0.17

0.54

Precision-Recall CurvesRF B1RF B2Pre-trained HGTPre-trained RF

Figure 4: Precision-Recall Curves of pre-trained HGT andRF versus Baseline Models.

dimensionality reduction technique [31], the color represents thelabel, the shape the data split set. The link prediction embeddings onthe left show a clear separation between the two labels, as opposedto the node classification embeddings on the right-hand side.

0 2 4 6 8 10 12

1

0

1

2

3

4

5

6

2-Dim UMAP Link Prediction EGCN Embeddings01trainvaltest

5 0 5 10 15

5

0

5

10

15

2-Dim UMAP Node Classification EGCN Embeddings01trainvaltest

Figure 5: Comparison of 2-Dim UMAP E-GCN Node Embed-dings for link prediction and node classification.

By further investigating the link prediction embeddings, we findthat the link prediction model separates companies in a comprehen-sible manner. We cluster companies using the KMeans algorithm onthe full embedding space [11]. The clusters reveal how the modelgroups companies according to past funding rounds, geographi-cal location, estimated size of future funding rounds, and othervariables. We are able to identify groups of companies that sharespecific traits, such as primarily Asian-based startups with fewhistoric funding rounds but high probabilities of a future fundinground.

5.3.3 Pre-trained RF Features. The most informative embeddingdimensions for the downstream RF model according to mean impu-rity decrease also show high correlations with the original features"Year Founded", "total rounds", "last finance" and geographical in-formation. This is in line with our findings from section 5.1 and5.2.

Across all model architectures, we can identify a small groupof features (6-8 out of 99) that highly influence the models’ predic-tive capabilities, whereas most of the remaining features have onlylimited influence. These features are "last finance", "total rounds"

and "total raised" (derived from the investor-company "funds" rela-tion), "board seats" (derived from the company-individual "employscom" relation), and the classical tabular features "Year Founded" andgeographical headquarter information. We are even able to showin figure 3 that the original relations ("employs com" and "funds")that form the basis of most of these aggregated features are a vitalsource of information for the GNN models.

5.3.4 Pre-trained RF Predictions. The pre-trained model has betterpredictive performance for companies with more and also higher(in terms of USD deal sizes) deals in the test time frame (see 3.4.1).This is intuitive as more deals and larger deal sizes are often associ-ated with high-potential startups, often also with more informationabout the respective company. These are properties that make iteasier for the model to predict future funding rounds accurately. Inaddition, the model achieves higher accuracy scores for youngercompanies (based on the year the company was founded). For com-panies founded before 2013, the prediction accuracy is 70.3%; be-tween 2013 and 2016 it is 72.5%; and for companies founded after2016 the accuracy increases to 78.5%. Subsequently, our results alsosuggest increased importance of a startup’s social network for ear-lier rounds, in line with previous studies [14]. Regarding predictivequality, we do not find significant differences across geographicalregions.

Also, the predictions for deals that occurred in 2019 (recall =57.1) were less accurate than predictions for the year 2020 (recall= 73.8). This is remarkable, as one could expect the pandemic-influenced year of 2020 to oppose stronger challenges for the model.In general, all models show higher performance for longer test timeframe horizons (24 months versus 12 months versus 6 months).We explain this using the inherent noise in future funding roundprediction. With a time frame of only 6 months, the occurrence ofa funding round in this small time frame is heavily influenced bynoise. Even if a funding round is planned for this time, it can easilyslip out of the time frame due to, for example, postponed deadlinesor longer due diligence processes. However, a time frame of 24months is better suited to average out the noise and be influencedmainly by the actual nature of the startup. This increases themodels’performance by providing more regular labels. Subsequently, thismakes the 24-months time frame the most realistic as well as themost informative time frame.

6 CONCLUSIONThis thesis investigates the value of graph features in the field ofcompany success prediction on the next funding round predictiontask for startups. We improve existing models by leveraging diversetime-series, textual and relational information and applying recentGNN model architectures. We also propose our own extension ofexisting GNN architectures. Additionally, we try to understand theGNN’s learning and prediction process as well as the models’ short-comings. Furthermore, we show that self-supervised pre-trainingcan result in node embeddings that significantly improve the pre-dictive performance of all downstream models.

Regarding the area of company success prediction, our resultsreveal that relational graph features significantly improve the pre-dictive capabilities of future funding round prediction models byaround 20% (Baseline 1 versus pre-trained RF). The second random

9


forest baseline model and all GNN architectures with access toedge features consistently outperform the first baseline model. Thepre-trained models are best able to predict future funding roundsof startups in the technology space by a significant increase in 𝐹0.5of around 17% (Baseline 2 versus pre-trained RF). Therefore, weconclude that recent GNN architectures can improve future fundinground prediction models if they are trained in a self-supervisedfashion first. The best model consists of a pre-text E-GCN modelfor link prediction followed by a random forest model for the down-stream task of node classification. With a precision score of 74%,recall of 61%, AUC of 78%, and precision@20 of 97%, the modelis able to predict future funding rounds of startups. Overall, themore recent and more complex HGT model architecture does notoutperform the R-GCN architecture in this particular task.

Furthermore, we are able to identify a small group of featuresthat contribute the most to the predictive capabilities of all models.These features are primarily derived from relational information,further emphasizing the importance of relational graph features. Inparticular, the relation between companies and individuals provedvaluable, confirming previous studies [51, 53]. As a lot of the ad-ditional value lies within edge features rather than node features,architectures leveraging multi-dimensional edge features are ofgreat importance for this task. Our simple E-GCN architectureshowed promising results.

Considering neural graph learning, the experiments reveal thatGNNs are valuable models for creating node embeddings usingself-supervised training. This is in line with previous findings fromthe fields of CV and NLP. However, their applications as end-to-endmodels for downstream tasks are limited. Our empirical resultsconform with known shortcomings of GNN models (see 2.2.3). In-terestingly, the majority of studies that were unable to outperformtree-based baseline models used GNN models as end-to-end nodeclassification or regression models [13, 14, 25], which coincideswith our findings. Additionally, we are able to visually show howembeddings created by link prediction appear to be more general,as opposed to highly distributed embeddings created by node clas-sification models.

Potential improvements of our work include different text andtime-series data processing techniques and the application of ad-vanced GNN explainability frameworks.

6.1 Future WorkIn the general area of neural graph learning, we see the poten-tial for comparisons of advanced GNN architectures that leveragemulti-dimensional edge features. These models should ideally gen-eralize well across tasks and fields. Graph learning explainabilityframeworks are another fruitful area of research. Even thoughthere already exist studies in the space [24, 50], these currentlydo not enjoy frequent usage in practice and usually are limited tospecific model architectures and deep-learning frameworks. A well-understood cross-model and cross-framework tool could benefitthe applicability of GNNs.

Further investigations into the general properties of GNN mod-els appear to be of great value. For example, investigating theircapabilities to learn non-linear manifolds, their de-nosing mecha-nisms, and potential improvements in architectural design could

boost theoretical understanding as well as practical applications.Generally, we conclude that investigations about the situations andtasks in which GNN models tend to perform well or poorly are stillrare. An avenue of research that should be explored further is thestrategy to create pre-trained embeddings using self-supervisedGNN training, which are passed to another model architecture forthe actual downstream task.

In the specific area of company success prediction, we see valuein gathering further relational information. All graph features atour disposal proved valuable. We, therefore, see great potential inadding more information to the graph. Specifically, the importanceof individuals and their relations to startups proved informative.Additional node and edge information about the relation could beof great importance. Also testing findings from our specific dataseton data gathered from other sources and companies operating indifferent areas could improve our understanding of startup fundingand potentially startup success.

10


REFERENCES[1] Yu Qian Ang, Andrew Chia, and Soroush Saghafian. Using Machine Learning

to Demystify Startups Funding, Post-Money Valuation, and Success. HarvardUniversity Press, 2020.

[2] Torben Antretter, Ivo Blohm, and Dietmar Grichnik. Predicting startup survivalfrom digital traces, 2018.

[3] Ivo Blohm, Torben Antretter, Charlotta Sirén, Dietmar Grichnik, and JoakimWincent. It’s a peoples game, isn’t it?! a comparison between the investmentreturns of business angels and machine learning algorithms. EntrepreneurshipTheory and Practice, 00, 2020.

[4] Moreno Bonaventura, Valerio Ciotti, Pietro Panzarasa, Silvia Liverani, LucasLacasa, and Vito Latora. Predicting success in the worldwide start-up network.Scientific Reports, 10, 2020.

[5] Niels Bosma, Stephen Hill, Aileen Ionescu-Somers, Donna Kelley, Maribel Guer-rero, and Thomas Schott. Global entrepreneurship monitor 2020/2021, 2021.

[6] Leo Breimann. Random forests. Machine Learning, 45:5–32, 2001.[7] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,

Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, BenjaminChess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, IlyaSutskever, and Dario Amodei. Language models are few-shot learners, 2020.

[8] Alejandro Cremades. How long it takes to raise capital for a startup, 2019.[9] Francisco Ramadas da Silva Ribeiro Bento. Predicting start-up success with

machine learning, 2017.[10] Philippe du Jardin. Bankruptcy prediction using terminal failure processes.

European Journal of Operational Research, 242, 2015.[11] Charles Elkan. Using the triangle inequality to accelerate k-means. In ICML,

2003.[12] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair com-

parison of graph neural networks for graph classification. In ICLR, 2020.[13] Pau Rodríguez Esmerats. Graph neural networks and its applications, 2019.[14] Clement Gastaud, Theophile Carniel, and Jean-Michel Dalle. The varying impor-

tance of extrinsic factors in the success of startup fundraising: competition atearly-stage and networks at growth-stage, 2019.

[15] Liyu Gong and Qiang Cheng. Exploiting edge features in graph neural networks,2019.

[16] Beth Hadley. Analyzing vc influence on startup success, 2017.[17] William L. Hamilton. Graph Representation Learning. 2020.[18] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation

learning on large graphs. In 31st Conference on Neural Information ProcessingSystems, 2017.

[19] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regu-larizers with gaussian error linear units. CoRR, abs/1606.08415, 2016.

[20] Thomas Hengstberger. Increasing venture capital investment success ratesthrough machine learning, 2019.

[21] Mike Herrington and Penny Kew. Global entrepreneurship monitor 2016/2017,2017.

[22] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning fortext classification, 2018.

[23] Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graphtransformer. In The Web Conference 2020 (WWW), 2020.

[24] Qiang Huang, Makoto Yamada, Yuan Tian, Dinesh Singh, Dawei Yin, andYi Chang. Graphlime: Local interpretable model explanations for graph neuralnetworks, 2020.

[25] Dejun Jiang1, Zhenxing Wu, Chang-Yu Hsieh, Guangyong Chen, Ben Liao, ZheWang, Chao Shen, Dongsheng Cao, Jian Wu, and Tingjun Hou. Could graphneural networks learn better molecular representation for drug discovery? acomparison study of descriptor-based and graph-based models. Journal of Chem-informatics, 13, 2021.

[26] George H. John. Robust decision trees: Removing outliers from databases. InKDD, 1995.

[27] Alexander Kessler, Christian Korunka, Hermann Frank, and Manfred Lueger.Predicting founding success and new venture survival: A longitudinal nascententrepreneurship approach. Journal of Enterprising Culture, 20, 2012.

[28] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization,2017.

[29] Thomas N. Kipf and Max Welling. Semi-supervised classification with graphconvolutional networks. CoRR, abs/1609.02907, 2016.

[30] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019.[31] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold

approximation and projection for dimension reduction, 2020.[32] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations

by solving jigsaw puzzles, 2017.[33] Hoang NT and Takanori Maehara. Revisiting graph neural networks: All we

have is low-pass filters, 2019.

[34] David L. Olson, Dursun Delen, and Yanyan Meng. Comparative analysis of datamining methods for bankruptcy prediction. Decision Support Systems, 52, 2012.

[35] Ramkishan Panthena. Startup success prediction. https://github.com/RamkishanPanthena/Startup-Success-Prediction, 2019.

[36] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, David Courna-peau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. Scikit-learn:Machine learning in python. CoRR, abs/1201.0490, 2012.

[37] Arushi Raghuvanshi, Tara Balakrishnan, and Maya Balakrishnan. Predictinginvestments in startups using network features and supervised random walks.Stanford University Press, 2015.

[38] Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, IvanTitov, and Max Welling. Modeling relational data with graph convolutionalnetworks, 2017.

[39] Boris Sharchilev, Michael Roizner, Andrey Rumyantsev, Denis Ozornin, PavelSerdyukov, and Maarten de Rijke. Web-based startup success prediction. InCIKM, pages 22–26, 2018.

[40] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and StephanGünnemann. Pitfalls of graph neural network evaluation. In NeurIPS, 2019.

[41] Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training ofneural networks using large learning rates, 2018.

[42] Cemre Unal. Searching for a unicorn: A machine learning approach towardsstartup success prediction, 2019.

[43] Rianne van den Berg, Thomas Kipf, and Max Welling. Graph convolutionalmatrix completion. In KDD’18 Deep Learning Day, 2018.

[44] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, PietroLiò, and Yoshua Bengio. Graph attention networks, 2018.

[45] Jingjing Wang, Haoran Xie, Fu Lee Wang, Lap-Kei Lee, and Oliver Tat SheungAu. Top-n personalized recommendation with graph neural networks in moocs.Computers and Education: Artificial Intelligence, 2, 2021.

[46] MinjieWang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, JinjingZhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin,Junbo Zhao, Jinyang Li, Alexander J. Smola, and Zheng Zhang. Deep graph library:Towards efficient and scalable deep learning on graphs. CoRR, abs/1909.01315,2019.

[47] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr., Christopher Fifty, Tao Yu,and Kilian Q. Weinberger. Simplifying graph convolutional networks, 2019.

[48] John Wu and Robert D. Atkinson. How technology-based start-ups support u.s.economic growth, 2017.

[49] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectifiedactivations in convolutional network. CoRR, abs/1505.00853, 2015.

[50] Rex Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec.GNN explainer: A tool for post-hoc explanation of graph neural networks. CoRR,abs/1903.03894, 2019.

[51] Eugene Liang Yuxian and Soe-Tsyr Daphne Yuan. Investors are social animals:Predicting investor behavior using social network features via supervised learn-ing approach. In Workshop on Mining and Learning with Graphs, 2013.

[52] Charles Zhang, Ethan Chan, and Adam Abdulhamid. Link prediction in bipartiteventure capital investment networks, 2015.

[53] Kamil Żbikowski and Piotr Antosiuk. A machine learning, bias-free approachfor predicting business success using crunchbase data. Information Processingand Management, 58, 2021.

11

https://github.com/RamkishanPanthena/Startup-Success-Prediction

https://github.com/RamkishanPanthena/Startup-Success-Prediction


A APPENDIX

Figure 6: Overview over the full working of the HGT layer[23].

12


Feature Original Type Range Description RFB1

RFB2

GNN

YearFounded Tabular 1980 - 2021 Company year founded x x xHQ long Tabular -180 - 180 Longitude of company HQ location x x xHQ lat Tabular -90 - 90 Latitude of company HQ location x x xGPT Cluster 0-59 Textual 0 - 1 60 non-exclusive Clusters created by GPT-3 based

on company description (one-hot encoded)x x x

Rep Followers Tabular 0 - 5’000 Companies Code Repository followers x x xNumPatents Tabular 0 - 500 Number of registered Patents per company x x xComReviews Tabular 0 - 5 Average employee review score per company x x xNumReviews Tabular 0 - 7’000 Number of employee reviews per company x x xemp yyyy Time-Series 0 - 450’000 Average employee counts per year (yyyy) x x xtraf yyyy-mm Time-Series 0 - 40’000’000 Average monthly (yyyy-mm) website traffic on

company websitex x x

total raised Relational 0 - 30’000 Total funding raised by company in M USD x x xtotal rounds Relational 0 - 100 Total number of funding rounds raised by company x x xfirst finance Relational 0 - 10’000 Days since first financing round of company x x xfirst size Relational 0 - 500 Size (in M USD) of first financing round of company x x xlast finance Relational 0 - 8’000 Days since last financing round of company x x xlast size Relational 0 - 1’000 Size (in M USD) of last financing round of company x x xeigen cen Relational 0 - 0.2 Eigenvector Centrality of company in

company-investor graph- x x

inv eigen cen Relational 0 - 0.4 Main Investors Eigenvector Centrality per company - x xinv HQ long Relational -180 - 180 Main Investors HQ location longitude - x xinv HQ lat Relational -90 - 90 Main Investors HQ location latitude - x xinv deals Relational 0 - 1’000 Main Investors total number of deals - x xboard seats Relational 0 - 300 Number of board seat positions per company - x xsimilarcompetitors

Relational 0 - 10 Number of similar competitors per company - x x

Investor Type Relational (N) 0 - 1 Type of Investor per investor (one-hot encoded) - - xPreferredindustry

Relational (N) 0 - 1 Preferred industries of Investor per investor(one-hot encoded)

- - x

Investmentpreferences

Relational (N) 0 - 1 Investment preferences of Investor per investor(one-hot encoded)

- - x

Deal Date Relational (E) 0 - 10’000 Days since deal was announced per investment - - xDeal Size Relational (E) 0 - 1’000 Investment Deal Size in M USD per investment - - xBusiness Status Relational (E) 0 - 1 Business Status of deal target per investment

(one-hot encoded)- - x

Deal Type Relational (E) 0 - 1 Type of deal per investment (one-hot encoded) - - xCo Investors Relational (E) 0 - 20 Number of Co-Investors per investment - - xpos com Relational (E) 0 - 1 Position of Individual person at company per

relation- - x

pos inv Relational (E) 0 - 1 Position of Individual person at investor perrelation

- - x

sim com Relational (E) 0 - 1 Similarity Score between similar competingcompanies per relation

- - x

sim no com Relational (E) 0 - 1 Similarity Score between similar non-competingcompanies per relation

- - x

Table 2: Overview of features used for modeling (N = Node Feature, E = Edge Feature), All Edge Feature Information are onlyavailable to GNN architectures that are able to leverage edge features.

13


7 6 5 4 3 2 1 0 1YearFounded

0.40

0.42

0.44

0.46

0.48

0.50

Parti

al d

epen

denc

e

1.0 0.5 0.0 0.5 1.0 1.5HQ_long

2.0 1.5 1.0 0.5 0.0 0.5 1.0HQ_lat

0.2 0.0 0.2 0.4 0.6 0.8 1.0total_raised

0.40

0.42

0.44

0.46

0.48

0.50

Parti

al d

epen

denc

e

0 2 4 6 8 10 12 14total_rounds

0.5 0.0 0.5 1.0 1.5 2.0eigen_cen

0.5 0.0 0.5 1.0 1.5last_finance

0.40

0.42

0.44

0.46

0.48

0.50

Parti

al d

epen

denc

e

0 10 20 30 40board_seats

0.2 0.0 0.2 0.4 0.6 0.8inv_eigen_cen

0 1 2 3 4 5 6 7inv_deals

0.40

0.42

0.44

0.46

0.48

0.50

Parti

al d

epen

denc

e

1.0 0.5 0.0 0.5 1.0 1.5first_finance

0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6last_size

0.06 0.05 0.04 0.03 0.02traf_2019-1

0.40

0.42

0.44

0.46

0.48

0.50

Parti

al d

epen

denc

e

0.03 0.02 0.01 0.00 0.01 0.02emp_2018

0.5 0.0 0.5 1.0 1.5similar_competitors

Figure 7: Partial dependence plots for random forest B2 model.

14


Model n Layer hiddenDims

Features EdgeFeatures

Accuracy Precision Recall AUC F-0.5 P@20 P@50

Pre-trainedHGT

1 256 emb x 70.8 72.1 58.0 76.6 68.7 88.0 83.0

Pre-trainedHGT

2 128 emb + G x 70.5 71.9 57.4 75.4 68.4 98.3 90.7

Pre-trainedHGT

1 128 emb + G x 70.5 71.4 57.9 76.5 68.2 88.0 84.4

Pre-trainedR-GCN

2 256 emb + G x 70.0 71.8 55.6 77.1 67.8 99.0 90.0

Pre-trainedR-GCN

1 512 emb + G x 64.3 60.8 59.5 68.8.4 60.5 87.5 70.0

R-GCN 4 256 G (x) 63.6 59.4 62.1 67.6 59.9 72.5 66.0Pre-trainedR-GCN

1 256 emb + G x 64.0 61.9 53.1 69.1 59.9 81.7 76.0

R-GCN 2 256 G (x) 63.5 59.5 60.6 67.1 59.7 75.0 74.0HGT 4 128 G (x) 63.2 59.4 59.4 67.1 59.4 87.5 76.0E-GCN 3 256 G x 63.4 60.4 55.4 67.4 59.3 68.3 67.3R-GCN 4 256 G - 62.5 58.1 61.9 66.6 58.8 77.5 73.0HGT 2 128 G (x) 62.9 59.5 56.1 66.3 58.8 78.3 70.7E-GCN 2 256 G x 62.4 59.3 55.2 65.7 58.3 78.3 68.7E-GCN 4 256 G x 62.6 59.7 53.6 67.2 58.3 82.5 70.0R-GCN 2 256 G - 61.4 57.2 59.6 65.4 57.5 78.3 76.0HGT 3 128 G - 61.2 57.3 56.4 65.1 57.1 70.0 69.3HGT 2 128 G - 61.2 57.1 56.9 65.7 57.1 76.7 68.0R-GCN 3 256 G - 60.7 55.9 62.5 63.3 57.1 58.3 58.7HGT 4 128 G - 61.0 57.1 55.4 64.8 56.7 72.5 67.0R-GCN 2 256 G-rand - 57.8 53.0 60.3 61.9 54.3 66.7 72.0HGT 2 128 G-rand - 54.0 50.8 69.6 57.9 52.9 53.3 56.0E-GCN 2 256 G-rand - 49.2 47.4 92.0 52.3 52.3 48.3 54.7

Table 3: Full GNN Model Experiments Results on test data, each mean of 5 runs (emb = embeddings, rand = random features,Dim = Dimensions).

15

predicting future funding rounds ... - scripties.uba.uva.nl

Documents