Developers

Real-Time Data Pipelines: The Declarative YAML Approach Without Writing Code

March 5, 202512 min read

In 2025, building a real-time data pipeline often involves hundreds of lines of Python, an orchestrator like Airflow, and weeks of configuration. What if you could define the same pipeline in 30 lines of YAML? The declarative approach changes the game.

The Pipeline Problem in 2025

The data ecosystem is more fragmented than ever:

  • Too many tools — Airflow, Prefect, Dagster, dbt, Fivetran, Airbyte... each tool covers a piece of the puzzle
  • Too much boilerplate code — 80% of pipeline code is plumbing, not business logic
  • Too much maintenance — Every Python dependency is a ticking time bomb (versions, conflicts, deprecations)
  • Too much latency — Most tools are designed for batch, not real-time

According to a 2024 Fivetran report, data teams spend an average of 44% of their time maintaining existing pipelines rather than working on new projects.


Imperative vs Declarative: Python vs YAML

The fundamental difference:

Aspect Imperative (Python) Declarative (YAML)
Approach "How to do it" "What to do"
Typical code 200-500 lines 20-50 lines
Learning curve Weeks Hours
Maintenance Python dependencies One binary + YAML
Real-time Complex to implement Native
Flexibility Unlimited Limited to plugins

The imperative approach gives you total control, but at the cost of complexity. The declarative approach sacrifices some flexibility for radical simplicity.

Anatomy of a YAML Pipeline

A declarative pipeline breaks down into three sections:

1. Sources — Where does the data come from?

source:
  type: http
  url: "https://api.example.com/events"
  method: GET
  auth:
    type: oauth2
    token_url: "https://auth.example.com/token"
  rate_limit: 100/minute
  pagination:
    type: cursor
    field: "next_cursor"

2. Transforms — What to do with the data?

transforms:
  - type: sql
    engine: duckdb
    query: |
      SELECT
        user_id,
        event_type,
        timestamp,
        json_extract(payload, '$.amount') as amount
      FROM input
      WHERE event_type IN ('purchase', 'refund')

  - type: pii_mask
    fields: [email, phone]
    method: sha256

3. Sinks — Where to send the results?

sink:
  type: postgresql
  connection: "postgres://user:pass@host:5432/db"
  table: "events_processed"
  batch_size: 1000
  on_conflict: upsert
  key: [event_id]

3 Concrete Use Cases

Case 1: API Sync to Database

You have a REST API emitting events and want to store them in PostgreSQL with SQL enrichment. In Python, that's 200+ lines (requests, psycopg2, error handling, retry...). In declarative YAML, it's 25 lines.

Case 2: CDC (Change Data Capture)

Capture changes from a source PostgreSQL database and replicate them to Snowflake in real-time. Native CDC eliminates the need for Debezium + Kafka Connect.

Case 3: PII Masking

Read data containing personal information, anonymize it (SHA-256 hashing), and send it to a data lake. Masking is declared as a simple transform, not a separate service.

Tool Comparison

Criteria Airflow Singer/Meltano Mako
Approach Imperative (Python DAGs) Semi-declarative Declarative (YAML)
Real-time No (batch) No (batch) Yes (native)
Transforms Python dbt (SQL) SQL + WASM
Installation Complex pip install One Go binary
Observability Web UI Logs Prometheus + Grafana

Quick Start in 5 Minutes

Here's how to get started with Mako, an open-source declarative pipeline framework written in Go:

# Clone and build
git clone https://github.com/Stefen-Taime/mako.git
cd mako
go build -o bin/mako .

# Initialize a new pipeline
./bin/mako init

# Run the pipeline
./bin/mako run pipeline.yaml

Mako supports HTTP/REST, Kafka, PostgreSQL CDC, DuckDB, and file sources (JSON, CSV, Parquet). Transforms include SQL via DuckDB, WASM plugins (Go/Rust), schema validation, and PII masking. For sinks: PostgreSQL, Snowflake, BigQuery, ClickHouse, S3, GCS, and more.

Conclusion

The declarative YAML approach won't replace Python for every use case. But for 80% of common data pipelines — API sync, CDC, simple ETL, PII masking — it offers a radically simpler and more maintainable alternative. Mako is an open-source framework (MIT) that embodies this philosophy: YAML in, events out.

Resources:

mcsÉdition — Where ideas take shape