The Great ETL Unbundling: AWS Zero-ETL and What It Means for Data Engineers
As AWS expands zero-ETL to Salesforce, SAP, and ServiceNow, we examine what this means for traditional data integration patterns and the future of data engineering.
“Zero-ETL” sounds like marketing hype—until you realize AWS just killed half your job description.
At re:Invent 2024, AWS expanded its zero-ETL integrations to include Salesforce, SAP, and ServiceNow. These connectors reached General Availability in December 2024, and they represent a fundamental architectural shift. The days of writing custom Airflow DAGs to sync CRM data are ending.
But what does this actually mean? Not just “faster data,” but fundamentally—how does zero-ETL change what data engineers build, and what we don’t?
What Zero-ETL Actually Is (and Isn’t)#
Let’s clear up the naming confusion: Zero-ETL doesn’t mean zero transformation. It means zero separate transformation infrastructure.
Traditional ETL Pattern:
Source (Salesforce)
→ Extract (Python script)
→ Transform (dbt/Spark on EC2)
→ Load (COPY to Redshift)plaintextZero-ETL Pattern:
Source (Salesforce)
→ AWS Glue Zero-ETL
→ Redshift (transform on-read with SQL/MV)plaintextThe transformation still happens. But instead of maintaining a Python-based extraction layer and a separate compute cluster for transformation, you transform in the warehouse using materialized views, stored procedures, or dbt running directly on Redshift.
The Three Zero-ETL Architectures#
AWS offers three patterns, and understanding which to use when is critical.
Pattern 1: Database Mirroring (Aurora, RDS → Redshift)#
Latency: Sub-minute Use Case: Operational database replication for analytics
This is the gold standard. AWS manages continuous CDC (Change Data Capture) from your transactional database into Redshift. You get near-real-time analytics without touching your production database.
Gotcha: The Redshift tables are read-only. You must create materialized views or new tables for any aggregation/transformation logic.
Pattern 2: SaaS Integration (Salesforce, SAP, ServiceNow → Redshift)#
Latency: ~1 hour minimum Use Case: Enterprise app data warehousing
This is where it gets interesting. AWS Glue zero-ETL pulls data from SaaS platforms using their APIs (e.g., Salesforce Bulk API, SAP OData).
Example:
-- In Redshift, you now have:
SELECT * FROM salesforce_zero_etl.account; -- read-only
SELECT * FROM salesforce_zero_etl.opportunity;
-- Build your analytics layer:
CREATE MATERIALIZED VIEW sales_pipeline AS
SELECT
a.name as account_name,
o.stage_name,
SUM(o.amount) as pipeline_value
FROM salesforce_zero_etl.opportunity o
JOIN salesforce_zero_etl.account a ON o.account_id = a.id
WHERE o.is_closed = false;sqlGotcha: Salesforce rate limits matter. AWS uses the Bulk API, but if your Salesforce org has strict API limits, you could hit throttling.
Pattern 3: Zero-Copy Data Sharing (Salesforce Data Cloud ↔ Redshift)#
Latency: Instant (query-time) Use Case: Federated queries across platforms
This is the most futuristic pattern. You don’t replicate any data. Instead, Redshift can query Salesforce Data Cloud directly using external schemas, and vice versa.
-- In Redshift, query Salesforce Data Cloud without copying data:
CREATE EXTERNAL SCHEMA sfdc_live
FROM DATA CATALOG
DATABASE 'salesforce_data_cloud'
IAM_ROLE 'arn:aws:iam::...';
SELECT * FROM sfdc_live.unified_customer_profile;sqlGotcha: Query performance depends on Salesforce’s infrastructure. This is best for ad-hoc exploration, not mission-critical dashboards that need sub-second response.
When Zero-ETL Fails (and Why You Still Need Traditional ETL)#
Zero-ETL is not a silver bullet. Here are the scenarios where it breaks down:
1. Complex Business Logic#
If your “transformation” involves fuzzy matching customer names across 5 different source systems using custom Python libraries, zero-ETL won’t cut it. You need a general-purpose compute layer (Spark, Fargate, etc.).
2. Non-AWS SaaS Platforms#
Zero-ETL only works for AWS-supported sources. If you need HubSpot, Stripe, or Zendesk data, you’re back to writing custom connectors (or using Fivetran).
3. Cost at Massive Scale#
Zero-ETL stores data in Redshift. For archival or cold data (logs you only query once a quarter), Redshift is expensive compared to S3 Parquet. Here, traditional ELT (Extract → Load to S3 → Transform on-demand with Athena) wins.
4. Cross-Cloud Integrations#
If your data lake is in Snowflake or BigQuery, AWS zero-ETL doesn’t help. You need a platform-agnostic solution.
The Death of the “Data Integration Engineer”?#
Hot Take: Zero-ETL is the Kubernetes of data engineering. It abstracts away infrastructure if you stay within the ecosystem.
In 2020, a “senior data engineer” role description included:
- Building custom connectors
- Managing Airflow DAGs for extractions
- Tuning Spark jobs for transformations
In 2026, those tasks are increasingly automated or eliminated by zero-ETL. What remains?
The New Skillset:
- SQL Mastery: Transformations now happen in the warehouse. You need to be fluent in window functions, recursive CTEs, and materialized view optimization.
- Data Modeling: Without custom Python to hide complexity, your dimensional model is your transformation layer.
- Cost Awareness: Zero-ETL trades engineering time for cloud costs. You need to understand Redshift pricing (storage, compute, concurrency scaling) to avoid bill shock.
Architecture Decision Flowchart#
Do you need rea-time (< 1 min) analytics?
├─ Yes → Is your source Aurora/RDS?
│ ├─ Yes → Use Database Mirroring (Pattern 1)
│ └─ No → Build custom streaming pipeline (Kinesis/Kafka)
└─ No → Is your source a supported SaaS platform?
├─ Yes → Use SaaS Integration (Pattern 2)
└─ No → Traditional ETL (Glue/Airflow)plaintextPractical Implementation: Salesforce → Redshift#
Here’s a real-world setup guide for AWS Glue zero-ETL with Salesforce.
Step 1: Configure AWS Glue Connection#
aws glue create-connection \
--connection-input '{
"Name": "salesforce-zero-etl",
"ConnectionType": "CUSTOM",
"ConnectionProperties": {
"CONNECTOR_TYPE": "salesforce",
"CONNECTOR_URL": "https://login.salesforce.com",
"USERNAME": "your-salesforce-user",
"PASSWORD_SECRET_ID": "salesforce/api/token"
}
}'bashStep 2: Create Zero-ETL Integration#
-- In Redshift:
CREATE INTEGRATION salesforce_integration
TYPE ZERO_ETL
SOURCE 'salesforce-zero-etl'
TARGET current_database();sqlStep 3: Build Transformation Layer#
-- Create a "cleaned" layer using dbt or SQL-based ELT:
CREATE SCHEMA analytics;
CREATE MATERIALIZED VIEW analytics.dim_customer AS
SELECT
id as customer_id,
name as customer_name,
TRIM(LOWER(email)) as email_canonical,
industry,
annual_revenue
FROM salesforce_zero_etl.account
WHERE is_deleted = false;
-- Refresh every hour:
CREATE TASK refresh_dim_customer
SCHEDULE 'RATE(1 HOUR)'
AS
REFRESH MATERIALIZED VIEW analytics.dim_customer;sqlConclusion: The Unbundling#
Zero-ETL is part of a larger trend: the unbundling of the data stack. Tasks that used to require general-purpose code (Python, Scala) are being absorbed into specialized platforms (warehouses, catalogs, orchestrators).
This isn’t the death of data engineering. It’s the maturation of it. Just as DevOps engineers stopped manually provisioning servers and started writing Terraform, data engineers will stop writing extraction scripts and start architecting semantic layers.
The question isn’t “Should I use zero-ETL?” It’s “Which parts of my stack can become declarative SQL, and which still need procedural code?”
Choose wisely. Your AWS bill depends on it.