TL;DR
Artifacts to introduce:
Experiment tracker to provide visibility and transparency.
Growth model to assess the impact of a hypothesis/experiment.
Rituals to update:
Make Hypothesis Strong with “We know, we believe, therefore we want to” approach.
Double-prioritization - prioritize hypotheses first, and then experiments.
Consider motions and follow-up hypotheses for prioritization.
Rituals to introduce:
Share the learnings.
Debrief sessions.
Growth Jams.
Way too often for people outside of a growth org it isn’t obvious what the growth team is going to do and why they’re going to do it. This lack of understanding might lead to losing quite a few trust points in the growth work, derailing the whole concept of growth. Fast forward a few years after those questions aren’t addressed properly, and we have a growth team that only does minor optimizations.
We don’t want that, so here is my take on how to fix (or improve!) the growth process and bring everybody on the same page.
We’d need a few ingredients:
Decouple hypotheses from experiments and make those hypotheses stronger.
Crafting an experiment and hypotheses tracker to have a single source of truth for the growth work.
Wrap up the growth work in a digestible format.
Build new rituals to build a bridge with other teams.
Making a strong hypothesis
Strong hypothesis a day keeps a 15-min call with Steve from HR away!
It’s important to realize the hierarchy here. Hypothesis first, experiments second.
Teams that confuse experiments with hypotheses tend to see a decrease in the experiment velocity, and also a lower success rate of those experiments. All of that only reinforces the initial problem.
Hypothesis is a result of analyzing the qualitative and quantitative data, and a vector of motion we want to test after this analysis. When we apply the first principle of thinking, we can derive the list of key elements we want to experiment on, thus proving the hypothesis. But never in reverse.
I came up with a template I’ve been using ever since to ensure that my hypothesis is bulletproof:
We know that _____, we believe this is because _____, therefore we want _____.
In the “We know” section, I list all the quantitative evidence that helps me describe an opportunity or a problem. This typically is a correlational observation. When one metric goes up, another one goes down. Or a segment of users who used feature X, tend to have conversion rate Y higher by Z%. Data from the previous experiments also goes under this section.
The quantitative data though never helps to understand the root cause of the observation. It just states the fact. For us to do growth work properly, we need to dig deeper. And that’s where the qualitative data is helpful. That is what goes under the “We believe” section. Here I usually might use user interviews, or meta-analyses of recent user research we did, market trends, or competitors' moves. You can learn more about different ways of collecting this data by referring to Alex Fishman’s and Behzod Sirjani’s guide on Product Discovery.
If you decide to go with user interviews, I just want to call out the Adjacent User theory specifically. To really understand why some people end up being activated and others don’t, it’s important to talk to both of those segments. Far too often you can discover some kind of a mental gap in understanding of your product between the two groups, that will result in a quicker win. Compare it to operating with just data from one side and countless experiments that go in a completely different direction, the benefit is obvious.
Being an activation lead at Bird years ago, we talked to those who didn’t get past the activation stage. We learned a lot about all sorts of problems people face with riding scooters for the first time. But talking to those who did activate, we learned that the main difference is not so much about the problems, but rather the understanding of what use cases can be covered with a scooter. That’s the Adjacent User theory in action, unwrapping a completely new path for growth work we didn’t even think existed.
The last section “Therefore we want” provides the direction we want to experiment towards. Whenever there’s high confidence, it can be pretty prescriptive, something like “Optimizing the social proof on the paywall”. In other cases, I’d leave it vaguer to spark the team's creativity and account for all the possible directions of “optimizing the checkout experience”, where social proof is just one of many other items.
Let’s take an example. Say, we’re building a music streaming app. We know that:
Paid Ads got us from $0 to $5M ARR.
We achieved 80%+ share of voice for well-performing KWs.
Other relevant KWs have high competition, with a 30% higher CPC. Our current conversion rate doesn’t let us have positive unit economics there.
We believe, that:
It’ll take us about a year to be able to sustain the higher competition - according to our growth model and forecasted results from the initiatives we work on.
Therefore, we want to add a new growth loop to support growth during this transition period.
After a couple of brainstorming sessions, the team came up with the idea of adding SEO to the acquisition mix. Let’s break it down to its core components.
Say, our SEO initiative is indexing people's public playlists.
For us to win this game, we need:
Make people create playlists.
Some % of those playlists will be public.
Those playlists will generate a sufficient amount of traffic.
And that traffic will have a high enough CR to sign up.
Those are the 4 variables we want to validate. And that’s what we’ll experiment with having multiple options for each one of those. After a few iterations, we might find out that those public playlists don’t generate enough traffic. Does it mean we lost?
No, it doesn't.
We can now revisit the hypothesis, and add a few learnings there:
People want to make playlists and share those.
It doesn’t make much sense to have those indexed by Google.
The question becomes:
How might we capitalize on the fact that people create playlists and are willing to share them?
One of the possible solutions - shift our attention to the referral motion. Let people work on the playlists with their friends and then share and promote them in their friend groups. The “fail” from our original hypothesis leads us to this amazing new track which in fact will support our future growth and give us a chance to become the top 1 channel, really.
Or maybe the problem with our initial idea was the naming of those playlists, so we might help people create better names for search terms like “Christmas music”.
The last piece of a strong hypothesis is what I call "a hypothesis chain". Meaning, if your hypothesis has a dead-end, that is not the best thing to work on. But if it helps us to choose one direction or another - that’s the one. A combination of qualitative and quantitative data that can spark the chain reactions indicates a potentially very strong hypothesis.
Additionally, if you observed an impact in a subset of users, or in a secondary metric, don’t rush to call it a winner, that’s p-hacking. Rather, re-launch the experiment targeting that subset of users and see if the dynamic remains the same. You can also replicate the experiment in case of a relatively good p-value, check Ron Kohavi’s work here.
Tracking experiments and hypothesis
With all that being said, there’s a need to track our progress. I’d like to introduce you to the experiment tracker I’ve been using for the past several years, and it’s the one I used at Vimeo. There are a few major benefits to using the tracker, which I now believe might be the most crucial artifact a growth team can create:
Clarity: the tracker allows everybody to be on the same page as to why the team is working on something, providing visibility into all the important details behind each experiment and hypothesis.
Knowledge capturing: as the growth team does more and more work, it’s easier for new folks to quickly search through past experiments to see what hypotheses won or were rejected, making sure no double work will occur.
Roadmapping: it shows how initiatives competed with each other and why some of those were picked for this quarter.
Now, let’s walk over each tab, shall we?
1st prio - Hypothesis
Not all the hypotheses are created equal. Therefore, I do the hypothesis prioritization exercise before I prioritize experiments. While there are a lot of frameworks, including one of my favorite - DRICE. In a nutshell, we weigh the expected impact and how much effort it will take.
One thing I learned at Bird is that having a common metric to measure impact is the best way to make sure that everybody understands why we do X and don’t do Y. Therefore, I calculate revenue impact whenever possible using the growth model (one of the most popular artifacts at Reforge!). For those of you, who prefer not to have ties between growth and revenue, feel free to swap the dollars with any other metrics that describe growth - something like uploads for Vimeo or rides for Bird would do it.
Now, the reason I write this article is to highlight something that doesn’t fit into the spreadsheet’s tracker. The last two parts of prioritization are:
Available motion.
Hypothesis chain.
For bigger companies, we all know that it’s hard to do something that goes against the current motion of the company, even if we have high hopes for a given initiative. Therefore, it’s important to consider if we need to create that motion first, or if we can benefit from the existing one. Feel free to skip it if you work at a smaller company.
Hypothesis chain - if possible, I try to articulate what will happen after we prove or neglect the hypothesis. Knowing, at least from the first glance, what result will influence our direction, helps me ensure the proper experimental design, and also to explain why something is important. The important takeaway here is that these last two prioritization lenses can have a significant influence on how we pitch ideas and defend the roadmap.
2nd prio - Experiments
Now, we follow the same approach we used for hypotheses, but with experiments. In the hypothesis PRD, we articulate those core principles that we want to test and that once tested, help us define the direction for future work.
I also include more things like track, experiment overview, and volume here. Track’s main goal is to make it easier for other folks to quickly search through the list of hundreds of experiments and see what worked and what did not. An experiment overview is just a short description of the test, used to refer to it while working with other peers. And volume is here to help us understand if we have enough traffic for a new experiment or if it was reserved for other tests. I often build a simple calculator right in the spreadsheet for quick math - we can take an average of two weeks of traffic for a given journey’s stage and then each time we want to add a new test into the next sprint, we can subtract the volume needed for a test from that 2w average. If it’s all good, the test goes into production, or it gets postponed.
An idea for folks who want to use the tracker from above, you can also specify the first principle each experiment is aimed to challenge. I didn’t do it so far, just because the maturity stage of growth org within the company wasn’t quite there yet.
Postproduction and experiments in recording
All the completed tests go under the Postproduction tab. I specify separately when a bucketing was active for a test, and the analytical window. If you work on something like retention, you might want to finish bucketing before it’s time to analyze the impact on the retention rate.
Those experiments that are currently live go under the “Experiments in recording” tab. Notice how those words are chosen. While at Bird, we called active tests - “Experiments in flight” referring to birds. This is done as a cultural thing. Smaller as it sounds, it does impact positively the team culture.
Overall, this double prioritization approach, combined with thinking on motion and chains allows us to ensure that we only do the work that needs to be done. Poor execution at the right thing beats excellent execution at the wrong thing, and this is my shot at trying to identify the right things.
Serve the growth nuggets
Once a test is moved into the postproduction stage, it’s time to share the learnings.
There are a few categories of learnings to extract:
Validating one of the first principles or core components - you just proved that something works, or doesn’t work.
The level of dependency between the two metrics - the nature of dependency comes in many flavors. It can be linear, exponential, or any other. We’re looking for something like: “A 1% increase in metric X, leads to a Y% decrease in metric Z”, which reinforces our growth model and can help estimate the impact of future work.
Triggering the chain hypotheses as the follow-up steps - you can call a hypothesis legit and move on to the next one, or suggest the team will be running additional experiments to see if the poor test design was the reason why it failed.
Those nuggets are usually served in a few ways:
Slack update - while it has the least friction, it also means people might not know how to adapt the learnings to their domains.
Email write-up - similar to Slack updates, but might be beneficial if a lot of images are included.
Growth Jams - regular growth meetings, where all the stakeholders get informed on the performance of previous experiments and new ideas are presented.
The preferable way depends on how stakeholders - both external and(!) internal - want to be informed. Despite the broader group’s choice, I suggest having debrief sessions with those stakeholders who can benefit most from the recent learnings.
Oftentimes, the growth team calls it a day once learning is shared. A better way to earn trust would be to assist those who can ride the learning wave with you. While working at Vimeo, I launched a profile questionnaire and an onboarding checklist. Both were successful for my product area, but I wanted it to be adopted across the entire org to maximize the ROI for the company. What I did was I called a few debrief sessions with other PMs to show the impact. I asked for their feedback and if they thought they could add something like that to their areas. Some agreed and some said they didn’t have enough time. With the former group, I then either built it with the resources of my team or influenced their managers/skip managers on the direction, earning me trust and visibility within that group at the same time.
Debriefs are not the last touch, it’s just the beginning of the next phase of your growth work. And after that, I encourage you to have Growth Jam sessions. Those sessions come in two different flavors - internal and external.
Internal growth sessions are needed to validate the assumptions, add any data points other people have to the existing hypothesis, check the bucketing criteria, and report on results and signals from past experiments. Growth managers, design and engineering partners, as well as customer research and BI team representatives, are welcome.
External sessions are for PMs from other teams and skip managers to learn about findings from completed experiments to discuss any overlapping initiatives.
Final Thoughts
By engaging other teams early on into growth jams, we can make sure that none of our work comes as a surprise to them. Additionally, if we frame hypotheses in a “we know that, we believe, therefore we want to” way, we help everybody to understand why we work on certain things. Pair it with separate prioritization of hypothesis and experiments, and you find people debating the facts, rather than their opinions. Then internal growth jams help conceptualize and verify the upcoming experiments.
All of that ensures that the growth team is doing the right work, and is fully transparent, so no questions are left unanswered as to why and what the team is doing.
Don’t forget to subscribe and let me know what you think!