I deleted data in prod and received a T-shirt; what's next?
Sharing how one critical mistake taught me key lessons
I've been a data engineer for almost 10 years now, but one mistake still haunts me: I accidentally deleted data in production. It was a tough lesson, but I learned a lot from it. As I share my story, I'll explain what happened and the important lessons I learned. My goal is to help you avoid making the same mistake and handle it better if it does happen. Believe me, it can happen to anyone. Even big companies like AWS have accidentally deleted data in production, and that's just one example among many. So, what should you do if it happens to you?
The call to delete
Let's roll back a few years to when on-premise big data clusters were the norm. There is no AWS S3 version history and no way to roll back data from the built-in cloud services.
It's late Friday afternoon, and I'm receiving a ping from a business stakeholder: "Data is not refreshed. Could you help us out? We need to decide for next week's marketing campaign".
That's weird; I didn't get any failure notifications. The root cause is a silent failure because we ran out of disk space on our cluster. As it's an on-premise cluster, I can't just extend the cluster with new nodes/disks; I have to free up some space to fix the pipelines immediately. As many companies back then (and still to this day), production cluster was also used for a dump of data that we may use in the future, but they were not activated (=used by the business), and there wasn’t any business case prioritized yet.
That's the tricky part when working with data.
In software engineering, it's common best practice to separate environments. Data teams would do the same. However, because the data is so connected, testing things in the development or staging areas can be challenging without using accurate data from production. In addition, it's a no-go for some organizations to have these production datasets available in a development environment for security/PII reasons. But I'm getting off track. I wanted to point out that it's pretty standard to have unused production data.
The above means we could clean that up quickly to unblock active use cases and data pipelines.
So here we go, deleting data in production.
The surgical operation
I'm an engineer who relies on trusted command lines, so let's dive into the terminal to check what's taking a lot of space and delete.
💡Lesson 1: Using the UI for some critical commands is okay. It may be slower, but UI usually has some safety net to confirm around critical operations like deletion. It’s harder to make a quick mistake when using it.
I'm starting to explore, and I suddenly find something :
ls /data/marketing/project_1 is taking up much space.
This one is not used at all, and no projects are planned for shortly. I’m doing a quick cd ..
to explore something else and then performing a delete operation rm -rf
on the whole data/marketing
folder 🤦♂️.
This happened in roughly one or 2s. I went too fast, getting in and out of some folder, and didn’t realize my current path was wrong.
💡Lesson 2: when you want to delete files in a critical environment, always use first an ls
command against that path and just replace this one with the delete command (rm
); that would help you to double-check that you aren’t doing anything wrong.
In hindsight, I felt under pressure. It was the end of the day just before the weekend, and I didn’t want to stay too long at work.
💡Lesson 3: Don't make business stakeholders or upper management push you under pressure; approaching a problem in such a way won't be beneficial. Especially when there’s an incident, you may need some help handling communication so that you can focus on the actual problem.
But common Mehdi, you probably have a backup? Well, we had the trash disabled for another maintenance operation. So, the files that have been deleted were not recoverable. The worst is that it was almost impossible to recover from the source for other technical reasons. Yup, I pushed my luck really hard on this one.
💡Lesson 4: While you can't control external circumstances, it's wise to plan for the worst-case scenario—just to be prepared.
Sharing the bad news
I had a good relationship with my manager, so I didn’t expect something terrible to happen. It was hard to understand how much “money” I made the company lost with that action as this data was not yet used anyway. That being said, I directly booked a 1:1 with him right after I assessed all the consequences of my action.
💡Lesson 5: This was good; I didn't rush to conclusions based on partial information. Take the time to assess the consequences and then spend even more time writing a post-mortem to discuss with the team how we could have avoided this. Together, you can plan for mid- to long-term solutions to prevent it from happening again.
My manager also didn't hide the information and quickly shared it with the relevant upper management/stakeholders.
Avoiding Data Deletion in Production
In summary, here are the main takeaways:
Restrict deletion permissions to a small, relevant group.
Don't hesitate to use the UI for critical tasks; it may limit damage if things go wrong.
If an incident occurs, stay calm and ask for help managing external communication so you can focus on solving the problem.
Don't be too hard on yourself; some things are beyond your control.
Take time to understand the situation before sharing it with the right people.
Write post-mortems and discuss mid- and long-term solutions with the team.
Most importantly, remember that your career is like a roller coaster.
Since that incident, I've heard many stories like mine, some with even worse outcomes. Luckily, my situation wasn't too bad. I didn't lose my job, maybe because I'd been there for over a year and had done many good things. Sure, one mistake can overshadow all the good, but it wasn't intentional. The T-shirt gift I got afterward is a reminder that, with the right workplace culture, you shouldn't be afraid to make mistakes—that's how you grow.
Your career will have ups and downs. Some things are out of your control, and even the things you can control may not always go as planned. But that's okay. Learn from your mistakes and strive to avoid them in the future.
Keep learning.