We rebuilt a data pipeline that had been waking up the on-call engineer three times a week. The fix was not better monitoring. It was better architecture.
The original system
The client had a data pipeline that pulled from six different sources, transformed the data, and loaded it into their analytics warehouse. It ran every night at 2 AM. And it failed constantly.
The failures were always different. One night, an API rate limit. The next night, a schema change in a source system. The next night, a timeout on a query that had been running fine for months.
The team had added monitoring, alerts, retries, and a 20-page runbook. The on-call engineer had become an expert at reading error logs at 3 AM.
The real problem
The pipeline was built as a single transaction. If any step failed, the entire pipeline failed. This meant that a rate limit on one source could prevent data from all six sources from loading.
The pipeline had no concept of partial success. It was all or nothing. And because the sources were unreliable, "nothing" happened three times a week.
The rebuild
We broke the pipeline into independent stages. Each source gets its own extraction job. Each job runs independently, retries independently, and fails independently.
The transformation layer reads from whatever data is available. If one source is missing, the transformations for that source are skipped. The rest of the pipeline continues.
We added idempotency at every stage. Running the same job twice produces the same result. This means retries are safe and recovery is automatic.
The result
The pipeline still encounters failures. APIs still rate-limit. Schemas still change. But now, a failure in one source does not cascade to the others. The on-call engineer gets an alert in the morning, fixes the specific issue during business hours, and reruns only the affected stage.
The on-call engineer has not been woken up at 3 AM in four months.