In 2025, building a real-time data pipeline often involves hundreds of lines of Python, an orchestrator like Airflow, and weeks of configuration. What if you could define the same pipeline in 30 lines of YAML? The declarative approach changes the game.
The Pipeline Problem in 2025
The data ecosystem is more fragmented than ever:
- Too many tools — Airflow, Prefect, Dagster, dbt, Fivetran, Airbyte... each tool covers a piece of the puzzle
- Too much boilerplate code — 80% of pipeline code is plumbing, not business logic
- Too much maintenance — Every Python dependency is a ticking time bomb (versions, conflicts, deprecations)
- Too much latency — Most tools are designed for batch, not real-time
According to a 2024 Fivetran report, data teams spend an average of 44% of their time maintaining existing pipelines rather than working on new projects.
Imperative vs Declarative: Python vs YAML
The fundamental difference:
| Aspect | Imperative (Python) | Declarative (YAML) |
|---|---|---|
| Approach | "How to do it" | "What to do" |
| Typical code | 200-500 lines | 20-50 lines |
| Learning curve | Weeks | Hours |
| Maintenance | Python dependencies | One binary + YAML |
| Real-time | Complex to implement | Native |
| Flexibility | Unlimited | Limited to plugins |
The imperative approach gives you total control, but at the cost of complexity. The declarative approach sacrifices some flexibility for radical simplicity.
Anatomy of a YAML Pipeline
A declarative pipeline breaks down into three sections:
1. Sources — Where does the data come from?
source:
type: http
url: "https://api.example.com/events"
method: GET
auth:
type: oauth2
token_url: "https://auth.example.com/token"
rate_limit: 100/minute
pagination:
type: cursor
field: "next_cursor"
2. Transforms — What to do with the data?
transforms:
- type: sql
engine: duckdb
query: |
SELECT
user_id,
event_type,
timestamp,
json_extract(payload, '$.amount') as amount
FROM input
WHERE event_type IN ('purchase', 'refund')
- type: pii_mask
fields: [email, phone]
method: sha256
3. Sinks — Where to send the results?
sink:
type: postgresql
connection: "postgres://user:pass@host:5432/db"
table: "events_processed"
batch_size: 1000
on_conflict: upsert
key: [event_id]
3 Concrete Use Cases
Case 1: API Sync to Database
You have a REST API emitting events and want to store them in PostgreSQL with SQL enrichment. In Python, that's 200+ lines (requests, psycopg2, error handling, retry...). In declarative YAML, it's 25 lines.
Case 2: CDC (Change Data Capture)
Capture changes from a source PostgreSQL database and replicate them to Snowflake in real-time. Native CDC eliminates the need for Debezium + Kafka Connect.
Case 3: PII Masking
Read data containing personal information, anonymize it (SHA-256 hashing), and send it to a data lake. Masking is declared as a simple transform, not a separate service.
Tool Comparison
| Criteria | Airflow | Singer/Meltano | Mako |
|---|---|---|---|
| Approach | Imperative (Python DAGs) | Semi-declarative | Declarative (YAML) |
| Real-time | No (batch) | No (batch) | Yes (native) |
| Transforms | Python | dbt (SQL) | SQL + WASM |
| Installation | Complex | pip install | One Go binary |
| Observability | Web UI | Logs | Prometheus + Grafana |
Quick Start in 5 Minutes
Here's how to get started with Mako, an open-source declarative pipeline framework written in Go:
# Clone and build
git clone https://github.com/Stefen-Taime/mako.git
cd mako
go build -o bin/mako .
# Initialize a new pipeline
./bin/mako init
# Run the pipeline
./bin/mako run pipeline.yaml
Mako supports HTTP/REST, Kafka, PostgreSQL CDC, DuckDB, and file sources (JSON, CSV, Parquet). Transforms include SQL via DuckDB, WASM plugins (Go/Rust), schema validation, and PII masking. For sinks: PostgreSQL, Snowflake, BigQuery, ClickHouse, S3, GCS, and more.
Conclusion
The declarative YAML approach won't replace Python for every use case. But for 80% of common data pipelines — API sync, CDC, simple ETL, PII masking — it offers a radically simpler and more maintainable alternative. Mako is an open-source framework (MIT) that embodies this philosophy: YAML in, events out.
Resources: