A software ecosystem built on data
I remember when I was first introduced to databases in the late 1990s. As a 12-year-old in India, while personal computers were not a thing yet, it was super exciting to learn the technology. Those were the days of floppy disks, Dangerous Dave, and Road Rash. I was a curious 7th grader, tinkering with C programming and a friend of my dad introduced me to visual FoxPro. It was revolutionary technology! FoxPro enabled programmers to write web applications, clients, middleware as well as manage the underlying databases and tables.
I was hooked on building contact lists of friends and family, to-do lists, wish lists of things that I wanted to get, which included a personal computer. Even though I sought a career in the Airforce as a teenager, databases became one of my favourite things.
Databases helped structure software and the Internet.
Tabular representation of lists and aggregates of values seemed to be everywhere in day-to-day decisions. At school and at home, tests, exams, schedules, numbers were everywhere associated with time and work. It was a seemingly ubiquitous primitive of information organization and decision making. Like a lot of people, my left brain developed mainly on formulas and tables as I took courses in mathematics, statistics, and computer science.
A quantitative approach to validating empirical evidence and hypotheses was something that I took for granted. The watershed moment for me came in 2011 when I took a bio-statistics course. Models built on data were already shaping various aspects of the world. In the class, I learnt about predictive models which guided healthcare systems. Since then, I have had extensive exposure to this pattern in surveillance, life sciences, manufacturing, automotive, eCommerce, and people analytics industry verticals. And I noticed a basic problem everywhere, hiding right under our noses.
Next time you think about an intelligent application that leverages large amounts of data, dig into the cost and the complexity of the data processing. Learn about the success rate of data initiatives. Ask about the experience of the people working to solve hard problems with data. Identify the capital and operating investment of running a large-scale data project in terms of infrastructure, time, talent. Look into the return on investment. What value does the data consumer get and what are they willing to pay for the value? Consider the consumption of natural resources to run the software.
We built the entire ecosystem of hardware and software around data, and for a large part, we have optimized it for hoarding data. The rate of data creation has been growing exponentially with the increase in connected devices, improvements in networks, and telecommunication.
Here are some stereotypical Big Data facts which are found in a lot of blogs for free:
In 2022, we create 3.5 quintillion bytes of data every day.
We have created 97 zettabytes of data or 97 trillion gigabytes of data globally as per Statista. They expect this to almost double up to 181 Zettabytes by 2025.
We have created 99% of all the digital data in the past 10 years and 90% in the past 2 years.
Businesses use up to 20% of the data that they create and store, and the remaining 80% is waste.
Most of us digital natives take our connected reality and data footprint for granted. But imagine what the world would look like if we could optimize our data footprint. If we could create and use only the data that we needed and deliberately discarded what we did not need.
It would be a world with simpler software that is easier to use and get our job done. A world where we would not need multiple softwares to manage our software and cloud related expenses. A world where we would have sustainable digital infrastructure that does not consume the amount of electricity that powers entire countries and delivers negligible value.
All of this is possible. As we progress into the 21st century, we are at the cusp of a data revolution - one that will shape the future of businesses and society at large. As we approach this transformation, it is our collective responsibility as digital natives to ensure that we deliberately shape data products for the benefits of businesses, people, and society. Three key shifts will define the third data revolution.
1. Data becoming a product
In early 2023, Coalesce published a top data technology trend report, including a prediction showing growth and maturity of data as a product. This prediction is consistent with my experience of launching data as a product at an eCommerce scale-up, where I delivered the first data assets to enterprise customers in late 2021.
The growth of data as a product will be a forcing function to a more deliberate focus on data governance, quality, and contracts. It may also be a catalyst for growth and maturity of domain driven data ownership paradigms like data mesh and data fabric.
However, one of the biggest problems that we need to address as we scale data products is duplication and waste.
Thomas H. Davenport, Randy Bean and Shail Jain wrote a Harvard Business Review article making a case for data products and data product managers. The article summary provides a directional description of data products - “As companies have struggled to make use of datasets and AI, many have started to create data products — reusable datasets that can be analyzed in different ways by different users over time to solve a particular business problem.”
The rationale presented in the article is excellent, and the definition of data products is limited. It’s likely because of sticking to the pattern of SaaS and coming up with DaaS or DaaP as a concept. This pattern solves for the important element of focusing on the problems and the jobs that the consumers are hiring the product/data to solve. However, most essays limit the definition of data products in a crucial economic sense. The purpose of a product is not just to enable the consumer to accomplish a job with relative ease.
A product is a vector of value exchange.
The most powerful application of the lens and principles of product management is to maximize the value captured by maximizing efficiency, simplicity, and distribution.
A proper definition of data product needs to index the meaning of a product more than the attributes of data or underlying technology. A simple definition of a product is an item that is made to be sold to build a business. It’s not practical to build a business without cash flow and profits and a product is how a business packages the value and earns the cash flow and profits to survive.
A data product is an item that is rooted in data made to be sold to build a business.
Reusable datasets are data products.
Predictive algorithms and models are data products.
Trends, reports, visualization, insights are data products as well.
Tooling that is built to support data products would also belong to the category of data products.
This way of defining data products is better suited to a deliberate product centric approach to define the value of the system, the cost of goods sold, the unit economics, etc.
For the consumer of the datasets, they are ultimately consuming the data to inform their business decisions and improve their profitability. For different data products, there are different consumers who have different needs and jobs that they need to get done. This mental model will move us into a paradigm of sustainable products which would measure and improve efficiency and reduce waste.
2. Data stack going back to the basics
In a VentureBeat article, Naveen Zutshi, the CIO of Databricks says, ”For a company to truly move forward with digital transformation, they need to combine data science and data analytics and draw from a single source of truth. We’ll see more CIOs cutting back on vendor spending to simplify their data architecture. Companies that implement an architecture that combines hindsight and predictive analytics to deliver efficient and intelligent solutions will win in the end.”
As an evangelist of a single source of truth, I am aligned with Naveen’s comment. But, we need to acknowledge the noise created by the growing set of tools that are rethinking layers of abstractions, collaboration, and workflows. The underlying primitives of data collection and processing have been the same for a significant period. And yet the ecosystem of tools has expanded. Most of the available tools are the same old wine in new bottles.
FirstMark VC MAD Landscape of machine learning, AI, and data built using data from CBInsights, has been the talk of the data town on the internet since late February 2023. A comment on the launch post demonstrates the growth of data tooling, “This year, we have a total of 1,416 logos appearing on the landscape. For comparison, there were 139 in our first version in 2012.” Thats more than 10X in 10 years.
Integrations and connectors have so far not done much beyond moving data and making more copies. Single source of truth has remained a theoretical reality and data silos have prevailed. As competitive as the pricing may be, every additional unit of storage, memory, and compute costs would result in hardware resources, power, and money.
What can we make of the data tools ecosystem?
Competition is great for fostering innovation and the data ecosystem has certainly fostered a high level of competition.
The boundaries between frameworks and platforms, horizontal plays and vertical plays, open source and for profit have gradually become more and more obscure.
In terms of architecture design, tool stack selection, returns on investment, there are a lot of options and it is a double-edged sword, because of the complexity of data management.
The modern data stack seems to be solving a problem that the data stack itself created in the past and creating more problems by making insane commitments and under delivering
Friendly Reminders:
A foundational trait of algorithms and data structures is optimization.
A critical idea complementary to the notion of garbage in garbage out that the data product ecosystem needs to comprehend is the no-free-lunch theorem.
The most underrated questions to ask about all machine learning project is if we can do it without machine learning. Embedded within that question is the purpose, rationale, and value of solving any problem using machine learning.
Some workaround, or alternative product accomplishes every job that needs to be done. Every product, therefore, needs to consider the shift in behaviour needed for consumers to be useful and the friction in adoption. Finally, the most important question is what is the return on investment. In other words, Is the juice worth the squeeze?
Tooling redundancies and silos are an anti-pattern of consolidation and single source of truth. It is bizarre how little value we have historically given to data quality, governance, version control etc. If data is framed as a product, we would stop hearing things like ‘we want to treat data as a first class citizen!’ We would treat data products as software.
3. Increasing focus on privacy, ethics, security, and sustainability
As data products scale, there will be more scrutiny of the side effects. We are nearly two decades into the promises of the efficiency of the big data analytics ecosystem, pay as you go cloud compute, and superior artificial intelligence technology. It’s been long enough for folks that have been in the technology economy to experience the digital transformation first hand, to start critically evaluating the impact delivered against the promises.
Whatever the future may hold, the value equation is not going to change. There is significant cost of building data products which come as infrastructure utilization, power consumption, talent capital investment.
A significant amount of this cost is front loaded or paid upfront in several data products. This can work if the data product is a secondary revenue stream of large corporation, and not the primary bet of a company who is trying to make it in the market.
Sure, generative AI is cool, and it’s great to have a simple interface to interact with a large language model. Some leaders are celebrating yet another ‘game changer’ technology, while others are sounding alarm bells. There are some critiques of the technology that are being silenced by major tech companies for publishing academic papers questioning the impact. Layoffs are affecting teams of ethicists.
Some of the bigger concerns regarding privacy, ethics, security, and sustainability are surfacing, and it is sensible to consider the key arguments or challenges in these categories. People are way more conscious and intentional about the privacy of their data. Security issues are associated with privacy as well as prevention of identity theft and fraud which would be even more rampant with the general availability of generative artificial intelligence.
Ethics and sustainability has several dimensions in play including energy consumption, costs, algorithmic biases and more.
Sustainability is a focus both in terms of economics of business as well as consumption and waste of natural resources. In a controversial paper titled ‘On the dangers of stochastic parrots,’ researchers present the cost of language models, “Training a single BERT base model (without hyper parameter tuning) on GPUs was estimated to require as much energy as a trans-American flight.1”
It is sensible to try and learn from the past errors to inform the future of the data stack and products. Building profitable and useful data products would require substantial effort and the odds have never been favourable.
The Second Data Revolution - How We Got Here
Soon after I started working in 2006, I was captivated by the second data revolution that was taking over. It was exciting to build distributed storage, parallel processing, in memory computation, predictive models, and machine learning applications. Data Warehouses, Massive Parallel Processing stores, complemented databases. Extract Transform and Load is the party we go to every night, and Big Data promised us a future with insights that would bring in unreal returns to businesses.
Never mind the technical jargon, most of these projects resulted in Swiss army knives looking for pockets, or solutions looking for problems.
Data projects are known for a high failure rate, around 87%. Machine learning projects have a higher complexity of technical debt as outlined by the paper published by folks from Google. They draw a valuable conclusion. “solutions that provide a tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice.2”
It seems like we are expecting a much larger return on investment, contrary to the first principles of computer science. “We have forgotten that the driver of innovation for the last 100 years has been efficiency,” said Steve Teig, CEO of Perceive. “That is what drove Moore’s Law. We are now in an age of anti-efficiency.3”
Since being swayed by the promises of Cloud and Big Data in 2011, I have worked with exceptional data engineers, data scientists, infrastructure developers and more. I appreciate the challenges with people, communication, professional development, leadership and it is certainly a major area of focus in the future of data products. In my experience the people problem exists in the context of the culture which develops around the framework and primitives of engineering.
As we look to the future of data, if we want to develop a sustainable model to develop valuable data products, considerations on refactoring the primitives are complementary to solving the challenges of communication and collaboration. It would not be enough to just implement process, or to train people, ultimately what we build has to run on silicons and transistors. We need to build systems that focuses on the quality of data at source and implements systems that will truly scale in value.
The problem with data
If we consider how the software ecosystem has evolved to make use of data, the challenges are clear. Historically software has been written with predefined data models optimized for entities and transactions.
As the scale of storage and compute has grown the engineering ecosystem has looked to develop hardware and software to do more with larger scale of data. Now we have several patterns to deal with data on disk, in memory, using several forms of partitioning, serialization, and formats. Each of these innovations have solved a specific set of problems for a specific set of use cases.
We now have a large number of data serialization and query patterns. which don’t port easily from context to context. Lack of interoperability of data for the same set of sources when taken through use case specific workflows to solve specific use cases leads to contextual silos which are extremely hard to clean up.
We could take the analogy of data is oil and consider the model of a refinery to think about this. There is however a key difference in the residue left by the oil that flows through a refinery and data that flows through layers of transformations. The difference is in the amount of residue.
Every company that is auditing the costs of their cloud infrastructure and data deals with their data retention in their first steps of optimization. Why is that? Because we have built big data systems for a long period of retention and high availability with little to show for economic value. We have considered the cost of storing and processing data at scale to be negligible compared to the benefits and poured a ton of investment into it and largely the juice has not been worth the squeeze.
The data industry is running a failing software business
The internet and social media has many examples of economic nightmares associated with large scale data processing. Take popular cloud computing vendors, as well as data platforms. Ask leaders from any company if the ratio of their cloud compute costs and bottomline revenue is of any concern to them. I bet that we will find a statistically significant amount of people who are concerned about their digital unit economics and cost of digital goods sold.
In a paper exploring energy and policy considerations of deep learning models university of Massachusetts students estimated the development cost of a part of speech tagging model referenced as "Linguistically-Informed Self-Attention." Over 6 months of model development the cloud compute costs are estimated up to $350K and power consumption costs estimated at $98704.
Anthropologist Steven Gonzalez Monserrate refers to the digital cloud as the carbonivore5!
These are not illegitimate claims. The power consumption of data centres and digital devices has been growing with technological innovation. Researchers from Meta in a paper recommends, “To develop AI technologies responsibly, we must achieve competitive model accuracy at a fixed or even reduced computational and environmental cost… Not all data is created equal and data collected over time loses its predictive value gradually. Understanding the rate at which data loses its predictive value has strong implications6” Researchers have also built a site to check the carbon footprint of machine learning models.
The Third Data Revolution
To play a long game, it is prudent to treat most architecture and data decisions as one way doors. It is likely feasible to decide and pivot or change course in a different direction. Technically, these decisions are two-way doors with a costly return ticket. Moving from one pattern and system to another are the complicated. It’s been easier to choose a model of addition and integration as a path of least resistance.
The road to hell is paved with good intentions…
and littered with low hanging fruits…
Indeed, good intention it is to continue business as usual and keep adding more tools, layers of abstractions, processes and keeping the core primitives intact. There is no way for companies to build profitable data products by adding and integrating, even if consumers pay. The economics does not work out.
The way forward is a paradigm where the primitives of data capture and transformation is built to power the future of data products and services.
These primitives need to be easy to use for engineers, and optimized for time sensitive data utilization.
The stack needs to be reduced to application data model, event streams and feature stores and the minimum required redundancies to operate. Finally the technology needs to be performant and nimble in terms of the storage, compute, and memory consumption footprint.
The good news is that the shift has been in motion for quite some time now and is gaining momentum. The pattern is not merely theoretical it is one that people are implementing and we will see the maturity of this model in the coming years.
I am privileged to be part of one such team that is persevering to develop a data platform which is built for a sustainable future.
The foundation of our worldview at InfinyOn is data products.
At InfinyOn, we are building a platform to build time sensitive data products of the future. Version control, sharing, reusability, efficiency, security, scale is part of the acceptance criteria.
Data as product is a critical step of reducing waste and optimizing value of software
Treating data as a product would make the data assets business critical by default, not an afterthought. One of the biggest shifts of this mindset is that data assets and teams would move from the mindset of an innovation investment and R&D to a cost centre with clear expectations of return on investment.
Product management function will emphasize product discovery, consistent iterative delivery, strategic innovation that has already shaped several companies would finally have the ability to shape the data assets. Data products will gradually move away from ‘solution in search of problems’ and ‘if we build it they will come’ framing to what problems are consumers willing to pay to solve, and how can we test and iterate solutions without having to front load investments.
We will move away from data hoarding and deliberately consider the value of data over time and apply proper cost benefit measures to evaluate bets and investments. We will move away from ‘slideware’ projects and invest in systems that are truly valuable. Software engineering practices, infrastructure management practices, operations, reliability will be a consistent pattern.
Data defined software and hardware. Now it’s time for us to redefine data
We are more aware than ever of how data impacts software. Data has propelled the progress of a large part of the digital ecosystem. The explosion of the data tooling ecosystem has helped us with many concepts of what a future of sustainable data products will look like.
Will the new data paradigm eliminate the headache of engineers on call and small deployments messing up entire pipelines?
Will the leaders, teams that build walls and blockers to prevent flexibility, access to data, change their outlook as data becomes an asset?
Will data quality monitoring, observability, and contracts become an acceptance criteria?
Will software engineers start treating data, data flows, or pipelines as first class citizens?Will we move away from the paradigm of human middleware to be bottlenecked by a barrage of ad-hoc queries?
We don’t have to imagine this. There are extremely successful implementation of these use cases in products and services that we use. We have already seen proper implementation of data products at organizations like Netflix, Uber, Coca Cola, DBS Bank among others.
Just like these companies, today more than ever, digital companies have the ability to build sustainable and valuable data products. We are at a critical fork in the road where we need to be thoughtful about our decisions. None of us by ourselves will have all the answers. We must be thoughtful about the decisions we make around data products and collaborate effectively with the ecosystem to reduce waste and increase efficiency.
These capabilities will become generally available to all companies as we move into the data product development paradigm which embraces a flow based programming experience leveraging the advances in performant and secure language like Rust and usability advances in Web Assembly.
It’s now possible to experience a world where processing data is simply deconstructed into patterns of capturing inbound data flows, applying operators of augmentation and enrichment, taking action based on parameters and thresholds, and piping the processed data into databases, feature stores, and analytical stores. Machine learning models would make inferences on the inbound data flow using APIs or using on-stream operators. We would gradually clean up the data junk and build sustainable and precise data products which are not characterized by the need for more data and the law of diminishing returns.
P.S.: If you are reading this and you have thoughts on the essay, I would love to hear from you. In the comments, or in conversation.
References
On the dangers of stochastic parrots: https://dl.acm.org/doi/pdf/10.1145/3442188.3445922
Hidden Technical Debt in Machine Learning Systems: https://proceedings.neurips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
https://semiengineering.com/ai-power-consumption-exploding/ , AI’s staggering energy cost: https://www.wired.com/story/ai-great-things-burn-planet/
Energy and Policy Considerations for Deep Learning in NLP: https://arxiv.org/pdf/1906.02243.pdf
https://mit-serc.pubpub.org/pub/the-cloud-is-material/release/1
Sustainable AI: Environmental Implications, Challenges and Opportunities: https://arxiv.org/pdf/2111.00364.pdf