BI on a Budget
An intro to my Lean Data Stack series, in which I look at the Modern Data Stack through the lens of a business on a tight budget.
Sometime around the mid-2000s, mullets came back into fashion. Rarely heard of since the 80s, men were once again asking their barbers for "business in the front; party in the back".
Somewhere in the grey area between the mid-2000s and the late 2000s, I got a mullet. By this point, most people had already come to their senses and agreed never again to speak of the mullet's brief second wave, and I found myself alone in a sea of Razorlight-inspired longhairs.
I tell you all this not because you're interested in my past hairstyles, but because this wasn't the only time I've seen a trend emerge almost overnight, and quickly become so pervasive that it seemed like it had always been there.
Sometime in the last couple of years, every software company and blogger in the Data & Analytics space started referring to the "Modern Data Stack". This is a collection of tools used to ingest, store, transform, visualise and analyse data, often coupled with tools to support version control, CI/CD, and data governance. Somehow, it seemed like a non-existent term became ubiquitous in a matter of months.
To an extent, a "modern" data stack has always existed. According to the Cambridge Dictionary, the word "modern" means "designed and made using the most recent ideas and methods". Since the “most recent ideas and methods” change with the times, it follows that the Modern Data Stack is not a static concept, but rather something that evolves based on the ideas and technology of the day.
Unless, of course, the word "Modern" refers here to a specific historical era, in which case we could be talking about any data stack in use since the fifteenth century, which seems unlikely. As fun as it may be to speculate about Bronze Age and Mesolithic data stacks, we'll assume that "modern" refers to the current moment, whenever that moment may be.
So, even though a Google search for "Modern Data Stack" returns few results prior to 2021, it's reasonable to assume that it has always existed in some form. It just never had a name.
That being said, I’ve never known data professionals to coalesce around a common set of tools in quite the same way that they typically do today. Between its Series A and Series B fundraises, a tech business will often begin to invest in its analytics capabilities, and hire a team to centralise, model, democratise and analyse data that previously sat across multiple silos. It seems most tech companies that have developed their data stacks within the past two years are using three of the the following products:
Snowflake (a cloud data warehouse)
BigQuery (another cloud data warehouse)1
dbt (a tool for building data transformations)
Tableau (a data visualisation tool)
Looker (another data visualisation tool)
Naturally, these all come at a cost. In recent years, capital has been cheap. Low interest rates and, in particular, soaring valuations have allowed founders to raise large amounts of external funding without giving away huge chunks of equity. Consequently, a recently-funded business would typically have a substantial war chest, with enough cash to last a couple of years while continuing to grow a (usually) loss-making business.
Investing in expensive data tooling, therefore, is a cost that scaling tech companies have been able to take on, without worrying too much about the impact on their runway. There’s plenty more cash in VC, after all. For this reason, the focus has usually been on selecting the best tools, rather than the most cost-effective tools.
With tech valuations tumbling2 since the start of 2022, those days of cheap capital appear to have ended, for now, and businesses are adapting accordingly. The recent tech layoffs are well-documented, and account for the bulk of the savings that companies are making to preserve cash, but businesses worried about their runway are also having to make tough decisions about software.
So, rather than discussing the ideal data stack for a business with effectively unconstrained financial resources, I will discuss the data stack that offers the most value for money, through the lens of a business operating on a tight budget. Rather than doing this through a single, overlong essay, I will share my thoughts through a series of posts, each dealing with a specific part of the data stack.
Of course, the implied duality between a spendthrift tech industry pre-2022 vs a cash-strapped tech industry today can seem disingenuous, and, in truth, finite software budgets are nothing new. As somebody with a dual Head of Data and Head of Finance remit, I am probably more price-sensitive than the average Head of Data. To some extent, this - along with a bias against premature optimisation - would explain why I lean towards tools that provide the most value for money based on the current needs of the business, rather than worrying too much about what permutation of the Modern Data Stack might best serve the business at some hypothetical future scale. So, the idea for these articles predates the recent/ongoing crash; it just happens to be even more relevant now.
All this is to say that, if you’re building data and analytics resources at a well-funded business that’s already found product-market fit, you may be coming at this problem from a different angle. However, if you’re interested by this special case of the Modern Data Stack, in which resources are limited by necessity, prudence, or both, do keep an eye out for my next few articles.
Part 1 on data warehousing:
Part 2 on data integration:
Part 3 - Data Transformation:
Part 4 - Data Visualisation:
Amazon Redshift and Microsoft Azure Database are still widely used, but seem (anecdotally) to lag Snowflake and BigQuery among new start-up adopters of data warehousing technology.
NASDAQ is a stock exchange that's heavily weighted towards tech companies, so provides a reasonable barometer of tech valuations, with the caveat that it represents a subset of publicly-traded tech companies rather than startups.