lakeFS
| lakeFS | |
|---|---|
| Error creating thumbnail: File missing | |
| Original authors | Einat Orr Oz Katz |
| Developer | Treeverse |
| Initial release | August 3, 2020 |
| Stable release | 1.72.0
|
| Repository | https://github.com/treeverse/lakeFS |
| Written in | Go |
| Engine | |
| Type | Data version control |
| License | Apache 2.0 |
| Website | lakefs |
lakeFS is a data version control system designed as an enterprise data infrastructure for data engineering and AI teams.[1] It brings Git-like operations — branching, committing, merging, and reverting — to large-scale data stored in object storage systems such as Amazon S3, Azure Blob Storage, Google Cloud Storage, and any other S3-compatible object storage.[2] lakeFS is used for multimodal data management, including data quality enforcement, reproducibility, and governance across data lakes and machine learning workflows.[3] lakeFS is available as an open-source project, an enterprise platform and as a managed service (lakeFS Cloud).[3][1]
History
[edit | edit source]lakeFS was created in 2020 by Einat Orr and Oz Katz at Treeverse.[4] Its first public release, v0.8.1, appeared in August 2020 and introduced Git-style operations and Amazon S3 compatibility.[5] In 2021, Treeverse raised $23 million in a Series A funding round led by Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures.[6]
In 2021, lakeFS was included in InfoWorld’s Best of Open Source Software (Bossie) awards.[7]
In June 2022, lakeFS introduced lakeFS Cloud, a managed service extending version control to cloud-based data lakes.[3] Version 1.0 was released in October 2023, marking the transition to production-grade status and adding integrations with Databricks, Apache Iceberg, and orchestration tools such as Apache Airflow.[1][8] Independent reports mention enterprise users such as Microsoft, Volvo, and NASA.[1]
In July 2025, lakeFS secured an additional $20 million in growth capital to expand its enterprise data infrastructure for AI workloads.[9][10]
In November 2025, lakeFS announced the acquisition of the open source project DVC. [11]
Software
[edit | edit source]Overview
[edit | edit source]lakeFS enables Git-like operations — branching, committing, merging, and reverting — to datasets stored in object storage. These operations allow teams to manage the data lifecycle with the same rigor as software development applies to managing code: testing changes in isolation, auditing modifications before production to catch issues, reproducing data states, and recovering quickly from errors or incidents.[2][1]
Architecture
[edit | edit source]lakeFS acts as a layer on top of object stores such as Amazon S3, Azure Blob Storage, and Google Cloud Storage. It maintains repository metadata to record commits, branches, and tags, enabling zero-copy data versioning and isolation. lakeFS exposes multiple interfaces, including a web UI, CLI, REST API, and SDKs, allowing integration with existing data engineering and machine learning workflows. It integrates with the modern data stack, supporting query engines, orchestration tools, and data processing frameworks without requiring changes to existing storage layouts.[2][3]
Functions
[edit | edit source]lakeFS provides version control and data management capabilities for object storage–based data lakes. Its core capabilities include:
- Version control and atomic commits: enables reproducible data versions and complete lineage tracking across repositories.[1]
- Zero-copy branching and merging: allows creating isolated branches for development and testing without duplicating data.[2]
- Automated hooks: supports configurable hooks that validate data quality or trigger external workflows before or after merges and commits.[1]
- Rollback and recovery: allows reverting a repository to any previous commit to recover from data errors.[2]
- Data lineage and metadata management: records commit history and metadata changes for auditability.[3]
- Multi-storage support: allows managing data across multiple storage systems from one instance, compatible with major object storage systems such as Amazon S3, Azure Blob Storage, Google Cloud Storage, and MinIO.[3]
- Reproducibility: enables reproducing experiments and model training based on fixed data versions.[1]
Integrations
[edit | edit source]Independent coverage has noted integrations with Databricks, Apache Iceberg, Red Hat OpenShift and Trino, as well as compatibility with orchestration tools such as Apache Airflow.[1][2] Independent materials also describe usage with Trino, including pre-merge validation patterns in versioned data workflows.[12]
See also
[edit | edit source]References
[edit | edit source]- ^ a b c d e f g h i Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ a b c d e f Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ a b c d e f Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).