Data Engineering Explained: Pipelines, Warehouses, and Why Your Data Team Is So Expensive

The question comes up in every board meeting at a data-driven company: why does the data team keep getting bigger? You hired four data scientists. Then you needed four data engineers to support them. Then the engineering team said the data platform was costing $500,000 a year in Snowflake bills alone. Then the data quality problems started - the fraud model was trained on stale data, the quarterly report had a discrepancy nobody could explain, the compliance team said they needed a full audit trail of every data transformation.

Data has become infrastructure. Like cloud computing or cybersecurity, it is no longer a department - it is a capability that the entire business depends on, and one that requires continuous investment to maintain. Here is what that investment is actually buying.


The Problem Data Engineering Solves

Data is generated in operational systems. When a customer makes a payment, a row is written to a transactions database. When a fraud analyst reviews a case, a record is updated in a case management system. When a customer service rep closes a ticket, a log is created.

These operational databases are optimized for one thing: fast reads and writes for the application that uses them. They are not optimized for analysis. When a risk officer wants to know the default rate by loan vintage, by product type, by origination channel, across the last three years - they cannot run that query directly against the production lending database. It would slow the database to a crawl and potentially take hours to execute.

Data engineering is the discipline of moving data from operational systems into analytical systems, transforming it into formats suitable for analysis, and maintaining the infrastructure that makes that data trustworthy and accessible.

That sounds straightforward. In practice, it involves an enormous amount of complexity.


ETL vs ELT: The Pipeline That Feeds Everything

The most fundamental data engineering concept is the pipeline - the automated process that moves data from source systems to analytical destinations.

ETL (Extract, Transform, Load) was the dominant approach for decades. Data is extracted from source systems, transformed (cleaned, standardized, business rules applied) before loading, and then loaded into a data warehouse. The warehouse contains clean, ready-to-use data.

The problem with ETL: it requires knowing in advance what transformations you need. The transformation logic lives in the pipeline, not in the warehouse. If a data scientist wants to analyze data in a new way, the pipeline must be rebuilt. In a world where analytical questions evolve constantly, this is a bottleneck.

ELT (Extract, Load, Transform) reverses the order. Raw data is loaded into the warehouse first, then transformed inside the warehouse using SQL. Modern cloud data warehouses (Snowflake, BigQuery, Redshift) have enough compute power to run complex transformations cheaply on demand. The raw data is preserved, transformations are code in the warehouse, and changes do not require rebuilding pipelines.

ELT has become the dominant approach for most modern data teams, because it gives analysts and data scientists the flexibility to define new transformations without waiting for engineering pipeline changes.

In fintech, the practical example is clear. A transaction database might contain 20 columns of raw data per transaction. The data team loads all 20 columns into Snowflake. The fraud team writes SQL transformations that compute velocity features (transactions in the last hour, from this device, to this merchant). The credit team writes different transformations that compute payment behavior patterns. Both use the same raw data with different transformation logic - possible because ELT preserved the raw data rather than pre-transforming it.


Data Warehouses, Lakes, and Lakehouses: Why the Terminology Keeps Changing

The data storage landscape has three dominant categories, and they have been evolving and converging.

Data Warehouses (Snowflake, BigQuery, Redshift) are structured analytical stores. Data is organized in tables with defined schemas - columns, data types, relationships. Queries run fast because the data is clean and well-organized. The cost: structured data only, and storage is relatively expensive compared to raw file storage.

Snowflake has become the dominant cloud data warehouse. Its separation of compute and storage (you pay for query compute and data storage independently) and ability to scale compute up or down on demand made it the default choice for data teams in the mid-2010s. A mid-sized financial services company might spend $200,000-$600,000 annually on Snowflake, depending on data volume and query complexity.

Data Lakes (Amazon S3, Azure Data Lake Storage, Google Cloud Storage) store raw, unstructured data - files in any format, at very low cost. A data lake might contain CSV files of transaction history, JSON logs from mobile apps, Parquet files of model training data, and audio recordings of customer service calls. Storage is cheap. The problem: querying it is slow and requires significant engineering to manage.

Data Lakehouses (Databricks, Delta Lake, Apache Iceberg) attempt to combine the benefits of both. Raw data stored cheaply, with a metadata and query layer that enables warehouse-like query performance without requiring data to be copied into a structured warehouse. Databricks has become the dominant platform in this category, particularly for companies with large-scale machine learning needs.

For most financial services companies, the practical question is less about theoretical architecture and more about "should we be on Snowflake or Databricks?" The answer depends on workload mix. Pure analytics (SQL queries, dashboards, standard reporting): Snowflake wins on simplicity. Large-scale machine learning, streaming data, unstructured data: Databricks wins on capability.


Why Data Quality Is the Actual Hard Problem

The infrastructure is solvable with money. Data quality is the problem that money alone cannot solve.

Data quality failures in financial services are not minor inconveniences - they are regulatory findings, model failures, and incorrect business decisions made at scale.

The credit model trained on stale data. A bank's credit model was retrained quarterly on loan performance data. The data team discovered - after an increase in defaults - that a transformation in the pipeline had been silently dropping records from a specific loan product for six months. The model had been trained on an incomplete picture of performance, and the deployed model was systematically underestimating default risk for that product. Fixing it required retraining the model, reviewing three months of originated loans, and a conversation with regulators.

The regulatory report with the unexplained discrepancy. A compliance team submitted a quarterly regulatory report. The regulator noted a discrepancy between two tables that should have been consistent. Tracing the discrepancy required reconstructing the data lineage - which transformations had touched each number, in what order, with what business rules. Without documented data lineage, the investigation took weeks.

The fraud feature store serving stale features. A fraud detection model used real-time features - velocity scores, device reputation, merchant risk. A pipeline failure meant the feature store was serving six-hour-old data instead of real-time data. The model's fraud detection rate dropped before anyone noticed the pipeline had failed. There were no alerts on feature freshness.

These are not hypothetical scenarios. They are common failure modes in data-heavy financial institutions. The investment in data engineering tooling - data observability platforms, data catalogs, lineage tracking, freshness monitoring - exists to prevent exactly these failures.


What All This Infrastructure Is Actually Buying

The data team budget can look opaque to executives who see Snowflake invoices without understanding what they fund. Here is a concrete mapping.

Investment What It Buys What Breaks Without It
Data warehouse (Snowflake/BigQuery) Fast, reliable queries for analytics and reporting Reports take hours; analysts wait for data; finance closes slowly
Pipeline tooling (dbt, Airflow, Fivetran) Automated, reliable data movement and transformation Manual data exports; stale dashboards; errors in reports
Data quality monitoring Alerts when data is wrong, stale, or incomplete Silent failures; models trained on bad data; incorrect decisions
Data catalog / lineage Documented what data means and how it was transformed Regulatory investigation requires weeks of manual reconstruction
Feature store (for ML) Consistent, fresh features for models in production Models behave differently in training vs production; fraud rates increase
Data engineers Maintain, monitor, and improve all of the above Infrastructure degrades; data scientists spend 80% of time on data wrangling


Key Takeaways

  • Data engineering is infrastructure, not a department. It is the plumbing that connects operational systems to every analytical, AI, and reporting use case in the company.
  • ELT has replaced ETL as the dominant approach because loading raw data first and transforming on demand is more flexible and faster to iterate than pre-defining all transformations before loading.
  • The Snowflake bill is not the real cost. The real cost is data quality failures - wrong models, incorrect reports, regulatory discrepancies, and the engineering time to debug them.
  • Data lineage and quality monitoring are not optional in financial services. Regulators expect you to explain every number in a regulatory report. "The data pipeline did something" is not an answer.
  • Data engineering headcount scales with data-dependent decisions. If more teams are making decisions based on data, the pipeline serving those decisions needs to be more reliable - which requires more engineering investment.

Related Reading