Data Engineering Digest, April 2024

Apr 30, 2024

Hello Data Engineers,

Welcome back to another edition of the data engineering digest - a monthly newsletter containing updates, inspiration, and insights from the community.

Here are a few things that happened this month in the community:

The most common tools data engineers use for orchestration.
How to create pipelines for difficult-to-automate sources.
What’s your preferred file format and why? (Not a good Q for a first date)
How data engineers master SQL.
The most challenging data quality issues we face with the rise of AI.
Why do modern OLAP dbs not have secondary indexes?

Community Discussions

Here are the top posts you may have missed:

1. What tools do you use to orchestrate pipelines?

Workflow orchestrator discussions are extremely common in the DE community. However, even though it’s commonly talked about, the tools used for workflow orchestration are evolving rapidly and data engineers are constantly trying to keep up with understanding the tradeoffs.

While we’ve indirectly touched on workflow orchestrators in the past, we think it’s a topic that deserves its own post.

💡 Key Insight

The top 5 tools pulled from the discussion are:

Cron - A simple command line scheduler found on most Unix systems. Developed by AT&T Bell Labs back in 1975 and still widely used today.
Windows task scheduler - A Microsoft Windows-based scheduler similar to Cron but also comes with a GUI. Built by Microsoft in 1995.
Dagster - A cloud-native orchestrator with built-in software engineering best practices. Released in 2019.
Apache Airflow - A widely-used open source workflow orchestration tool. Created by AirBnB in 2014 (now an Apache project) and still very popular among DEs.
AWS Eventbridge Scheduler - A serverless scheduler that’s well integrated with AWS services. Works like a more advanced version of Cron and compatible with Cron syntax.

While the question was about which tools we use to orchestrate pipelines, a few of the top answers are schedulers.

A scheduler is a simple tool that triggers a job on a recurring basis. As you can see from the responses, they work well for their use case and are reliable and cheap to use. The major downsides are: they lack awareness of the state of data pipelines, are difficult to develop/test with, and typically don’t have built-in monitoring or alerting capabilities.

So when should you use a scheduler vs a workflow orchestrator?

Scheduler: Use for small scale and simple workflows with few dependencies.

Workflow orchestrator: Use for large scale/mission critical workflows or complex workflows with multiple dependencies and conditional branching. Additionally, choose this option if you need better monitoring and alerting for errors.

While workflow orchestrators are typically better suited for data engineering, they also come with the overhead of configuring and managing them. Schedulers will always have a role to play and data engineers shouldn’t over complicate simple pipelines.

2. How often do you run into data sources that you can’t automate?

Example for me would be a vendor portal that’s locked down and only allows export to csv or excel. They won’t allow us to directly connect…Thoughts?

Connecting to data sources has become much easier over the past several years but data engineers still face situations where there is no easy automatable way to ingest the data they need. Hopefully you don’t face this too often in your career but it’s still good to know how data engineers are solving this problem currently.

💡 Key Insight

For the data engineer who asked the question, the typical answer is using web scraping via browser automation or robot process automation (RPA). These techniques are similar and popular ways of gathering data directly from websites that don’t offer an API.

Browser automation allows you to programmatically traverse the contents of a website to pick out and save data that you care about. You would typically find this data by identifying the HTML/JavaScript/CSS elements that contain the data. Selenium is a commonly used tool for this task in data engineering.

Power automate, UI Path, and Automation Anywhere are some other popular RPA tools that were mentioned for web scraping. These RPA tools have become more popular in recent years and are typically low-code and allow users to point and click at elements to automate/extract information.

The downside of these approaches is that websites change all the time and you may end up doing a lot of maintenance to keep one of these data pipelines running. It’s worth reaching out to the source first to see if there’s another way before resorting to web scraping.

3. Preferred file format and why? (CSV, JSON, Parquet, ORC, AVRO)

There are several different file formats that are commonly used in data engineering. If you’re new to the field you may wonder what you should be using and when.

💡 Key Insight

CSV - Widely used format, human readable, moderately easy to process with machines.
JSON - Easy to process with machines. Still readable but not as readable as CSV.
Parquet - Columnar file format, ideally suited for analytical workloads.
Avro - Better suited for high write workloads vs Parquet.
Orc - Columnar file format, less support vs parquet.

Some other takeaways from this discussion:

Parquet is very popular for analytical workloads but newer table formats like Iceberg and Delta lake are catching up quickly.
CSVs, while widely used, can be brittle/difficult to ingest if quoting or delimiters are not configured correctly.
If all you’re doing is bulk loading data into a data warehouse, CSV may perform better than Parquet (YMMV).

4. How do you guys master SQL?

One member was seeking advice on learning and mastering SQL for a potential career change. Despite knowing the basics, when confronted with more advanced topics like common table expressions (CTEs), window functions, etc. they felt “stuck and confused” and asked the community for tips on mastering the language.

While relatively simple at first glance, SQL is a deep and robust language. The basics are easy to pick up, however mastering it can take years of real-world experience. SQL is a core skill required for every data engineer to understand in order to efficiently query and manipulate data.

💡 Key Insight

Practice, practice, practice. Platforms like LeetCode and HackerRank are great to practice and solve SQL problems but they aren’t a substitute for working with real-world data and experience working in SQL. Data engineers shared that they do in fact use much of the “advanced” SQL concepts the author mentioned in their day-to-day.

In the discussion, a great free SQL course on Free Code Camp was shared (no affiliation).

We also have a ton of great DE-recommended SQL learning resources on the wiki.

5. What are the most challenging data quality issues you face frequently?

I read this stat somewhere that about 56% of organizations don’t have high enough data quality to take advantage of the advanced analytics and AI capabilities. I’m curious to learn what are the most common data quality challenges you face everyday that you wish you had a solution for.

*“Impact of DQ on AI”, source: https://www.reddit.com/r/dataengineering/comments/1bvpz99/impact_of_dq_on_ai/*

There have been many memes touching on this problem that is being exacerbated by the demand for AI. One of the key tenets of AI/ML is that garbage in means garbage out. If the training datasets are skewed, full of errors, or improperly prepared, it translates into biased or ineffective models.

Data engineers play a crucial role in creating a culture of shared ownership of data quality, maintaining high standards, and ultimately whistle-blowing when data is being used inappropriately or would cause adverse effects on decision making.

💡 Key Insight

The biggest data quality challenges data engineers face are:

Not being able to tie together customer data from different systems.
Manually entered data that contains errors, misspellings, or duplicate data.
Schema changes and lack of change management or a lack of a unified schema.
Missing or incomplete data which may skew analytics/ML/AI.

These are all common problems for data engineers and perhaps all symptoms of an underlying issue where master data management (MDM) and data governance aren’t given sufficient priority or are perceived as less important compared to other initiatives.

Have thoughts on how to fix this? Join the discussion.

6. Why don't modern OLAP data warehouses use secondary indices like B trees or similar for indexing arbitrary columns?

One member decided to query some large public datasets with BigQuery for a side project and it led them down a rabbit hole of trying to find answers about why modern OLAP databases don’t typically support secondary indexes.

If you are optimizing your queries on an OLAP database and wondering why it’s scanning the entire column then this post is for you.

💡 Key Insight

OLAP databases are optimized for one thing: big analytics. Think throughput, not latency.

That means aggregating a lot of data typically in a few columns. To filter more efficiently on this data, they typically offer strategies like clustering and partitioning which can store the data more efficiently and allow the query engine to query the data more efficiently.

TL;DR: If you’re wondering why it’s taking a while to find a single row in a large dataset, the simplified answer is that it’s not what it’s optimized for.

BUT many OLAP dbs do offer solutions to this problem. Snowflake has search optimization service, Clickhouse has a skip index, StarRocks has a bloom filter index, and there are many other examples of this. While these solutions are different from a typical index in an OLTP database they are designed to tackle the problem of row lookups on columns that may not make sense to cluster due to attributes like high cardinality.

Join the conversation

🎁 Bonus: