On Rebel Theorem 3.0

Jared Heyman
9 min readJul 3, 2024

--

Rebel Fund has now invested in nearly 200 top Y Combinator startups, collectively valued in the tens of billions of dollars and growing. As an extremely data-driven fund, along the way we’ve built the world’s most comprehensive dataset of YC startups outside of YC itself, now encompassing millions of data points across every YC company and founder in history.

One motivation for building such a robust data infrastructure is to train our Rebel Theorem machine learning algorithms, giving us an edge in identifying high-potential YC startups. A couple of years ago, I published a post On Rebel Theorem 2.0, the world’s most advanced ML algorithm for predicting YC startup success at the time. Today, I’m excited to unveil version 3.0, trained on exponentially more data and boasting performance metrics that far surpass its predecessor.

As a Silicon Valley venture fund, we’re fortunate to sit at the forefront of the AI revolution now underway. This not only helps us invest in future AI unicorns, but also exposes us to tools that can enhance our own operations and fund performance. Our Rebel Theorem 3.0 algorithm leverages AI to collect and analyze data in ways previously impossible. Our vast AI-enhanced dataset, coupled with what we know about each YC startup’s latest operating status, fundraising history, and valuation, helped us to train Rebel Theorem 3.0 to identify key predictors of YC startup success.

In this post, I’ll explain what our Rebel Theorem 3.0 model predicts, its impressive performance metrics, and the most important startup characteristics or “features” it analyzes.

Outcomes

Before I get into which features our algorithm found to be most predictive of YC startup success, I should explain exactly what outcomes the algorithm sets out to predict in the first place.

The most straightforward independent variable (i.e., the thing you want to predict) would be startup valuation growth. However, as I discussed in my previous post On the power law of Y Combinator startups, a very small percentage of YC companies drive the overwhelming majority of investor returns, and if we trained the model to strictly predict valuation growth, it would basically want every company to look just like Airbnb and Stripe :)

So, we instead bucketed the startups into categories, and trained the algorithm to predict which category a startup would most likely fall into. For Rebel Theorem 2.0 algorithm training, we bucketed them into 4 valuation outcome categories (<$150M valuation, $150M-$499M valuation, $500M-$1B valuation, over $1B valuation) and for Rebel Theorem 3.0 we used even broader categories:

“Success” — $60M+ valuation and operating or exited

“Zombie” — Under $60M valuation and still operating¹

“Dead” — No longer operating nor exited

I know it sounds less impressive to predict startup outcomes so broadly, but in an asset class where a single big winner can generate a 1000x return and the majority of investments end up being complete write-offs, just knowing which companies not to invest in is incredibly valuable.

Performance

The chart below shows the algorithm’s precision at predicting which category each YC founder’s startup would eventually fall into compared to random, based only on information available about that founder and company at inception. As you’ll see, it’s astonishingly good at predicting which startups will end up successful or dead — more than twice as good as random! Very few human venture investors can consistently outperform the market by such a wide margin, especially at such an early stage.

The investors amongst you are probably wondering what this level of model performance could translate into in terms of financial returns. We wondered the same, so backtested what net portfolio IRR the algorithm would have achieved if we had given it some money and told it to only invest in the new YC startups it thought would be successful from 2012–2020².

As you can see, the algorithm achieved a net portfolio IRR of 44–59% in the more mature YC batches from 2018 and earlier, and a 27–34% IRR in the less mature 2019 & 2020 vintages (in green)³. These are insane levels of performance, dramatically better than investing randomly in YC startups (in red), which itself is much better than an S&P 500 index average (in light dashed gray). Perhaps AI is coming for my job next!⁴

Features

Now that I hopefully have your attention, let’s do a deeper dive into the startup features that our algorithm was trained on and that factor most heavily into its success scores. Since Rebel invests in companies at such an early stage, we can’t always rely on data points like operating metrics, existing investors, and growth rates. Since we’re limited to more fundamental variables like team quality and basic company attributes, we trained Rebel Theorem 3.0 on the same.

In the interest of both brevity and protecting Rebel’s intellectual property, I won’t share a comprehensive list, but here are a few general categories:

Founder history — These are things you could assess by looking at a founder’s resume, like where they went to school, what they studied, how many years of co-founder or general work experience they have, what companies they worked at in the past, etc. Most of these features aren’t hard to assess, but it is hard to know which ones matter most when it comes to predicting startup outcomes.

Previous startups — These are features related to previous startups that the founders co-founded in the past, like how much capital they raised, how many employees they had, whether they were acquired or went public, etc. We learned these features are very strong predictors of startup success.

Founder demographics — For the sake of scientific rigor, we trained an AI to infer various founder demographic features based on their name and history, like their most likely gender, ethnicity, and country of origin. We were relieved ethically to find that demographics have very little impact on startup success.

Founder personality — It would be impractical to ask every YC founder to take a personality test, but thanks to new AI tools, we can now accurately infer a founder’s personality profile based on their online footprint. It turns out that founder personality traits are a powerful predictor of startup success.

Company attributes— These features are related to the company rather than founders, like its location, sector, headcount, etc. We learned that these don’t matter nearly as much as the founders themselves, nor do they seem to be persistently predictive over time as the technology landscape evolves.

Other stuff — We looked at a wide variety of other founder and company features which I’ll keep confidential. Pretty much any quantifiable startup or founder characteristic we could think of was fair game. A few of them turned out to be quite predictive of startup success and most were not.

We tried to really push the envelope in terms of data collection, AI inference and analytics to uncover new insights around what predicts YC startup performance. To my knowledge, no fund has ever developed a more comprehensive dataset and algorithm for this purpose.

Out of the hundreds of founder and company-related features we considered, our brilliant data data scientist Justin Hilliard found several dozen with a statistically significant impact on model performance. I’ll highlight a few of them below, but before doing so, I should explain how our machine-learning model works in the first place.

Our latest model is non-linear, meaning that it goes beyond a simple “more (or less) is better” for each feature. Rather, it looks at the interplay between features to make its predictions.

As a simple example, a linear model might find “more years of co-founder experience is better” but a non-linear model might find “5 years of co-founder experience is statistically ideal for healthcare startups in the US with founders who have high levels of pragmatism, are repeat founders with an acquisition, and went to school at Stanford or MIT.”

This ridiculous run-on sentence actually underplays the interdependency of model features, since in practice, Rebel Theorem 3.0 looks at relationships between dozens of features of each startup and team and runs them through 150+ “decision trees” to calculate each startup’s final success score. To illustrate, I’ll include one such decision tree below.

No matter how experienced, no human investor could build a mental model of this many startup features and decision trees, much less use them to objectively and accurately predict the success of a startup… in a matter of seconds. No wonder our algorithm outperforms most human investors!

While it’s impossible to understand or explain the full inner workings of a non-linear machine learning model, we can look to which features the model depends on most heavily when making its predictions. This chart shows how much the model performance degrades when you take certain features away from it — the more degradation there is, the more important the feature is.

The most important features by far are those related to previous startup experience, like co-founder years and times, and whether a founder has taken a company to acquisition. The next most important features are related to the founder’s level of general work experience. Several founder personality traits are also quite important when it comes to predicting the success of their startup, like risk aversion, socialness, and pragmatism. Interestingly, only one company-related feature made the top list — whether it serves US customers.

Another way to assess the importance of various features is to see how frequently they are used by the model in the aforementioned decision trees.

While the order of the features shifts a bit and a few new features come up, the general trend is the same — what matters most are features related to the founders’ work experience and previous startup experience, plus certain personality traits (I blinded a couple of features that we consider too proprietary to share publicly).

There’s a lot more I could share about our Rebel Theorem 3.0 algorithm performance, features, and how we use it in practice as a fund, but hopefully this is enough to give you a flavor. One thing I can say with confidence is, as an investor, once you’ve built the technology infrastructure and grown accustomed to making more data-driven startup investing decisions, working off gut instinct alone seems primitive, unpredictable, and even irresponsible by comparison. I have no doubt that data and AI are the future of startup investing.

¹I know this is a harsh term for a $50M company that’s still alive and well, but believe it not, these are failures from the perspective of most early stage venture funds after a certain amount of time. The vast majority of returns are made on the outliers that exit for hundreds of millions to billions of dollars.

²This is purely an academic exercise, because in the real world, startup selection is only half the battle. You also must be able to access the startups you wish to invest in, which requires relationship building, partner quality, the right check size, decision-making speed, reputation and brand, a strong network, portfolio support, and more.

³Includes both realized and unrealized investor returns. We conservatively assumed no return for both Dead and Zombie companies, a 50% dilution rate in subsequent rounds, and that no company is worth over $10B regardless of its actual valuation.

⁴I say this jokingly, because AI isn’t yet ready to fully replace VCs. There are still many startup characteristics it can’t yet assess as well as a human, like founder trustworthiness and the timeliness of their idea, not to mention other parts of the job like network building and portfolio support. For now at least, AI/ML + human experts is best for early-stage investing.

--

--

Jared Heyman

Tech guy and investor. Founder at Rebel Fund and previously Pioneer Fund, CrowdMed (YC W13), Infosurv & Intengo (acq. LON: NFC). Ex-Bain consultant. Data nerd.