Skip to main content

2 posts tagged with "Open Source"

View All Tags

Starlake + DuckLake: Start Small, Scale Big

· 7 min read

The Post-Modern Data Stack is about reducing friction, not adding tools. It’s about building end-to-end data systems that are declarative, open, and composable, without the complexity and lock-in of the Modern Data era.

Starlake and DuckLake embody this philosophy. Starlake unifies ingestion, transformation, and orchestration through declarative YAML, while DuckLake delivers a lightweight, SQL-backed lake format with ACID transactions, schema evolution, and time travel, all on open Parquet files.

Together, they let you start small, develop locally, and scale seamlessly to the cloud, without changing your model or mindset.

Why Move Beyond the Modern Data Stack?

The "Modern Data Stack" (MDS) brought cloud agility, but also fragmentation, hidden complexity, vendor lock-in, and brittle pipelines. Performance and openness are equally critical.

For example, in an independent TPC-H SF100 benchmark on Parquet files, DuckDB delivered sub-second query times for most queries. This proves that open file formats plus a high-performance engine can match traditional analytics platforms at a fraction of the cost.

DuckLake: A Lake Format for the Post-Modern Stack

DuckLake introduces a next-generation lake format built on Parquet, coordinated by a real SQL database (PostgreSQL, MySQL, SQLite, or DuckDB). This design unlocks:

  • Multi-user collaboration: SQL-based catalog enables concurrent reads and writes with transactional guarantees.
  • ACID transactions & snapshots: Full transactional integrity, snapshot isolation, time travel, and schema evolution, reliability once reserved for data warehouses.
  • Open & composable: Based on open standards (SQL + Parquet), so you can use your favorite engines and orchestration tools. No proprietary runtimes or hidden metadata.
  • Local-to-cloud consistency: Develop locally with DuckDB, then deploy to a shared PostgreSQL-backed DuckLake in the cloud, no rewrites, no migrations, no friction.

Getting Started: Local to Cloud in Minutes

  1. Install Starlake (docs)
  2. Bootstrap your project:
cd /my/project/folder
starlake bootstrap
  1. Configure DuckLake in application.sl.yml:
application.sl.yml for local development
version: 1
application:
connectionRef: {{ACTIVE_CONNECTION}}
connections:
ducklake_local:
type: jdbc
options:
url: "jdbc:duckdb:"
driver: "org.duckdb.DuckDBDriver"
preActions: >
INSTALL ducklake;
LOAD ducklake;
ATTACH IF NOT EXISTS 'ducklake:/local/path/metadata.ducklake' As my_ducklake
(DATA_PATH '/local/path/');
USE my_ducklake;

Set ACTIVE_CONNECTION=ducklake_local for local development.

  1. Scale to the cloud: Change your DuckLake catalog connection to PostgreSQL or MySQL, and update DATA_PATH to your cloud storage (e.g., GCS, S3):
application.sl.yml for cloud deployment
version: 1
application:
connectionRef: "ducklake_cloud"
connections:
ducklake_cloud:
type: jdbc
options:
url: "jdbc:postgresql://your_postgres_host/ducklake_catalog"
driver: "org.postgresql.Driver"
preActions: >
INSTALL POSTGRES;
INSTALL ducklake;
LOAD POSTGRES;
LOAD ducklake;
CREATE OR REPLACE SECRET (
type gcs,
key_id '{{DUCKLAKE_HMAC_ACCESS_KEY_ID}}',
secret '{{DUCKLAKE_HMAC_SECRET_ACCESS_KEY}}'
SCOPE 'gs://ducklake_bucket/data_files/');
ATTACH IF NOT EXISTS 'ducklake:postgres:
dbname=ducklake_catalog
host=your_postgres_host
port=5432
user=dbuser
password={{DUCKLAKE_PASSWORD}}' AS my_ducklake
(DATA_PATH 'gs://ducklake_bucket/data_files/');

Set ACTIVE_CONNECTION=ducklake_cloud for cloud deployment. You can now transition from local to cloud without changing your data models or transformation logic.

Starlake + DuckLake: The Perfect Pair

Need in Post-Modern StackHow Starlake addresses itHow DuckLake enables it
Quality-first ingestionValidation/quality checks at ingestionMetadata enables lineage, versioning, auditing for trusted ingestion
SQL-only, portable transformationsTransformation logic as plain SQL, no templatingParquet + catalog via SQL keeps transformations portable and engine-agnostic
Local dev, global deploymentDuckDB local dev, deploys with no changesSupports DuckDB locally, scales to larger catalogs/storage; same data format persists
Git-style data branchingSnapshot/branching semantics for datasetsSnapshots/time-travel provide data versioning like code branches
Orchestration-agnostic pipelinesSQL lineage, DAGs for any orchestratorUnified metadata for referencing dataset versions, dependencies, and snapshots
Semantic modelling agnosticOutputs semantic layer models for multiple BI platformsOpen, portable dataset format; semantic models not locked to one tool

Why DuckLake is a Strong Alternative to Cloud Data Warehouses

DuckLake offers a unique set of advantages over cloud-based data warehouse solutions like BigQuery, Snowflake, and Databricks:

  • Lower and Predictable Costs: DuckLake minimizes costs by allowing you to store data directly in open Parquet files on affordable object storage (like S3, GCS, or on-premises solutions), without requiring expensive proprietary compute or storage layers. You avoid per-query or per-user fees, and only pay for the storage and compute you actually use, making budgeting straightforward and cost-efficient.
  • Full Data Control and Privacy: With DuckLake, your data never leaves your environment. This makes it easier to comply with privacy regulations and internal security policies, and lets you implement custom security measures as needed.
  • Optimized Performance: DuckLake achieves high performance by leveraging the efficiency of the Parquet file format and the power of embedded analytical engines like DuckDB. By operating directly on columnar storage and minimizing data movement, DuckLake delivers fast query execution and analytics, even on large datasets, without the overhead of traditional data warehouses.
  • Open and Transparent: The open source codebase means you can audit, modify, and extend DuckLake as you see fit. There are no hidden operations or proprietary formats.
  • Vibrant Community and Ecosystem: DuckLake benefits from an active open source community that continuously improves the platform, provides support, and shares best practices. Its foundation on open standards ensures compatibility with a wide range of tools and platforms, making data migration and integration straightforward as your requirements change.

Added Value of Cloud Data Warehouses

While solutions like DuckLake offer many advantages, cloud data warehouses such as BigQuery, Snowflake, and Databricks also provide significant added value:

  • Fully Managed Service: Cloud data warehouses handle infrastructure, scaling, maintenance, and updates automatically, reducing operational overhead for your team.
  • Elastic Scalability: Instantly scale compute and storage resources up or down to match workload demands, paying only for what you use.
  • Integrated Ecosystem: Seamless integration with a wide range of cloud-native tools for analytics, machine learning, data ingestion, and visualization.
  • High Availability & Disaster Recovery: Built-in redundancy, backup, and failover capabilities ensure data durability and business continuity.
  • Global Accessibility: Access your data securely from anywhere in the world, supporting distributed teams and global operations.
  • Advanced Security & Compliance: Enterprise-grade security features, compliance certifications, and fine-grained access controls are managed by the provider.
  • Performance Optimization: Providers continuously optimize performance behind the scenes, leveraging the latest hardware and software advancements.

These features make cloud data warehouses an attractive choice for organizations seeking minimal operational burden, rapid scaling, and access to a rich ecosystem of managed services.

Conclusion

Starlake and DuckLake represent a decisive shift toward the Post-Modern Data Stack, where openness, simplicity, and scalability coexist. Instead of assembling a tangle of incompatible tools, data teams can now build pipelines that are declarative, SQL-driven, and environment-agnostic from day one.

With Starlake, you define your data flow once: ingestion, transformation, validation, orchestration, all in YAML and SQL. With DuckLake, you store and query your data in an open, transactional lake format that scales from a local DuckDB setup to a cloud-backed PostgreSQL catalog. The result: a development experience as simple as working on your laptop, yet scalable to enterprise-grade reliability and performance.

Recent performance tests show DuckLake processing 600 million TPC-H records in under 1 second for most queries, proving you don’t need heavyweight infrastructure for warehouse-class performance.

The future of data engineering is declarative, composable, and open. With Starlake + DuckLake, you can truly start small and scale big, without ever compromising on speed, quality, or control.

Dbt Fusion vs. Starlake AI: Why Openness Wins

· 5 min read
Hayssam Saleh
Starlake Core Team

Starlake vs. Dbt Fusion: Why Openness Wins

Dbt recently launched Dbt Fusion, a performance-oriented upgrade to their transformation tooling.
It’s faster, smarter, and offers features long requested by the community — but it comes bundled with tighter control, paid subscriptions, and runtime lock-in.

We've been there all along but without the trade-offs.

At Starlake, we believe great data engineering doesn’t have to come with trade-offs.

We've taken a different approach from the start:

Free and open-source core (Apache 2)
No runtime lock-in
Auto-generated orchestration for Airflow, Dagster, Snowflake Tasks, and more
Production-grade seed and transform tools

Let’s break it down.

Feature-by-Feature Comparison

note

Disclaimer: Dbt offers a free tier for teams with fewer than 15 users. This comparison focuses on organizations with more than 15 users, where most of Dbt Fusion’s advanced features are gated behind a paid subscription.

  • Fast engine: Starlake uses a Scala-based engine for lightning-fast performance, while Dbt Fusion relies on a Rust-based engine.
  • Database offloading: Dbt Fusion uses SDF and DataFusion, while Starlake leverages JSQLParser and DuckDB for cost-effective SQL transformations and database offloading.
  • Native SQL comprehension: Both tools enable real-time error detection, SQL autocompletion and context-aware assistance without needing to hit the data warehouse. The difference ? With Dbt Fusion, it’s a paid feature. With Starlake, it’s free and open.
  • State-aware orchestration: Dbt Fusion's orchestration is limited to Dbt Saas Offering, while Starlake generates DAGs for any orchestrator with ready ones for Airflow, Dagster, and Snowflake Tasks.
  • Lineage & governance: Dbt Fusion offers lineage and governance features in their paid tier, while Starlake provides these capabilities for free and open.
  • Web-based visual editor: Dbt Fusion comes with aYAML editor only as part of their paid tier, while Starlake offers a in addition to a YAML editor, a free web-based visual editor.
  • Platform integration: aka. Consistent experience across all interfaces, Dbt Fusion's platform integration is available in their paid tier, while Starlake provides free integration with various platforms.
  • Data seeding: Dbt Fusion supports CSV-only data seeding, while Starlake offers full support for various data formats (CSV, JSON, XML, Fixed Length ...) with schema validation and user-defined materialization strategies.
  • On-Premise / BYO Cloud: Dbt Fusion does not offer an on-premise or BYO cloud option, while Starlake supports both allowing you to use the same tools and codebase across environments.
  • VSCode extension: Dbt Fusion's VSCode extension is free for up to 15 users, while Starlake's extension is always free.
  • SaaS Offering: Dbt Fusion is a SaaS offering, while Starlake is open-source with a SaaS offering coming soon.
  • MCP Server: Dbt Fusion's MCP Server requires a paid subscription for tools use, while Starlake provides a free full-fledged MCP Server for managing your data pipelines.
  • SQL Productivity tools: Dbt comes with DBT Canva, a paid product, at Starlake this is handled by Starlake Copilot through english prompts, which is free and open-source.
Feature**Dbt Fusion **Starlake.ai
Fast engineYes (Rust-based)Yes (Scala-based)
State-aware orchestrationLimited to Dbt own orchestratorYes on Airflow, Dagster, Snowflake Tasks, etc.
Native SQL comprehensionBased on SDFBased on JSQLParser/JSQLTranspiler
Database offloadingDataFusionDuckDB
Lineage & governancePaid tierFree
Web-based visual editorNoYes and always free
Platform integrationPaid tierFree
Data seedingFor tiny CSV-onlyProduction grade support for various formats with schema validation
On-Premise / BYO CloudNot availableYes
VSCode extensionPaid tierAlways free
MCP ServerPaid tierYes (free)
SQL Productivity toolsPaid product (DBT Canva)Free and open-source (Starlake Copilot)
SaaS OfferingYesComing soon

Strategy Matters As Much As Features

Many tools advertise flexibility - but in practice, they quietly funnel users into proprietary runtimes.
Dbt Fusion is no exception.

Their orchestrator is gated behind a paid cloud platform, and most features require a subscription once your team grows.

Starlake doesn’t play that game.

We provide:

  • A single declarative YAML layer for extract, ingest, transform, validate, and orchestrate
  • One config = Multiple warehouses (BigQuery, Snowflake, Redshift…)
  • Your orchestrator = Your choice, fully integrated
  • Auto-generated DAGs, no manual workflow wiring
  • Run it locally, in the cloud, or anywhere in between

Who Should Choose Starlake?

Starlake is ideal for:

  • Data teams who want speed without lock-in
  • Enterprises who need production-grade on premise and cloud data pipelines without vendor lock-in
  • Startups who want open-source pricing and cloud-scale performance
  • Teams who prefer Airflow, Dagster, Google Cloud Composer, AWS Managed Airflow, Asttronomer, Snowflake Tasks, or any engine they already trust

Whether you're building your first pipeline or managing thousands across clouds, Starlake lets you grow on your terms.


Final Thought

Dbt Fusion makes bold claims — and to their credit, they’ve pushed the modern data stack forward.

But openness without freedom is just marketing.

Starlake gives you both.
✅ Open-source.
✅ Free to use.
✅ Orchestrate anywhere.

👉 Ready to experience the freedom of open-source, no lock-in data engineering ? Visit starlake.ai, check out our documentation to get started or join our community to learn more.