Approaching Zero-ETL with FOSS

Leaving aside complexities like the Enterprise Integration Patterns, we can consider most integrations as a form of advanced ETL: Extract, Transform, and Load. We extract data from a data store or service. Then we transform it from an input to an output format. And finally we push or load that transformed data into some output channel. It is the easiness to connect with the input and output channels what makes the ETL need a proper integration framework.

Complex integrations will combine these three steps differently. But the outcome is always to move information from one place to another, connecting different systems. Where the information may be a full dataset or just a triggered event.

I already tackled the issue of choosing the right integration tool from an engineer’s perspective and what variables to take into account. But when we are talking about data science and data analysis, there is a requirement that goes on top of all of the previous: the accessibility and easiness of usage of the tool.


Also called No-Code-ETL or Zero-Code-ETL, the Zero-ETL is the next generation integration paradigm in which users can perform ETL seamlessly. On the traditional way of performing ETL, you needed a developer that wrote the implementation, even if the code was a simplified language like the ones offered by Apache Camel. And then you needed someone to deploy it.

While writing a workflow using a simplified language is a much easier task than having to write it from scratch in Python or Java, it requires certain skills and maintenance work afterwards.

Zero-ETL helps you focus on a Domain Driven Design approach. We switch the focus to the conceptual data you are going to use, and not the technicalities. Data will be moved from one system to the following without worrying about intermediate steps like transform, and cleaning. You will connect to those systems without really caring what technology lies behind them.

Zero-ETL is defined differently depending on who you ask. On summary, a Zero-ETL differentiates from classic ETL because the interaction between the different data warehouses and data lakes are transparent to the user. You can mix data coming from different sources without really worrying about where the data is coming from and what format it has.

The wonders of Data Mesh

Some cloud platforms like Amazon and Google are selling basic Zero-ETL capabilities that transfer data in near real time from one of their services to another, sometimes also transforming the data transparently, like offering a no-SQL to a SQL translation.

Let’s forget for a moment that has been already possible with the proper tools in place for a long time. And focus on how they all conveniently forget that there is a whole universe outside their offerings; how they are trying to force a dependency to their platforms. They are on purpose ignoring the hybrid cloud, which is what most of us are working with. A cloud composed by several providers, different services, and protocols. A cloud in which a transparent, seamless integration would require providers talking the same language. Data Mesh across providers is a reality. And Zero-ETL provided by most cloud platforms is not covering that use case properly.

If you are a Data Driven organization, you already know that Data Science has gone through several trends in the past decades. We used to have a lot of fuzz about big data. Crowd sourcing. Interoperability. At some point we started talking about data warehouses, data lakes, data ponds, water gardens,…

Is it possible to achieve a Zero-ETL in a hybrid cloud world?

Data Mesh across providers

Once we change the perspective of how we view data data, we can make it closer to a product. It makes sense to store each data domain in their own data store and provider, the one that suits them most. Then, we will need to worry about how we are going to build our applications and services using that data. This paradigm usually comes with Event Driven architectures and data streams.

Sometimes we will have duplicated data in different formats and schemas to offer them in shapes more suitable for each domain. We have to carefully consider how and when to synchronize the different data storage to avoid inconsistencies.

With this change of perspective, there is also a switch on how we approach data mesh. Instead of seeing data as something that can be ingested into our software, now we also have data being served over common protocols. Protocols that need to be discovered easily. Instead of centralized pipelines that distribute data changes, we now publish events as streams.

Staying ahead of the curve in Zero-ETL

There’s many ways to perform Zero-ETL without tying yourself to a specific provider. The easiest way is by using Free and Open Source Software in your software stack.

Probably we will need to combine several clusters. But we will still want to use our hybrid cloud as one single cloud. We can trust Skupper to do the work for us, as shown on the following video.

Distributed event streaming platforms like Kafka can offer us a decentralized solution for connecting different services and data. Shifting to federated data storage with event driven stream of changes requires careful synchronization that Kafka can help with.

Camel K will help us deploy seamlessly and manage the integrations in a Kubernetes like cluster. Installed as an operator, it will deploy and monitor the middleware integrations needed for your specific use cases and make them serverless, if possible.

The user interface

And last but not least, you need some low code or no code editor to build your ETL. That’s what Kaoto is for. Kaoto is a multi-DSL flow editor. It allows you to create and deploy integrations in a visual way. Kaoto supports Apache Camel and works seamlessly with Camel K to deploy the workflows generated.

Five step flow build with Kaoto. This is a no code editor that allows zero-etl.
Building no code Zero-ETL with Kaoto

You can test Kaoto using the official demo instance. This instance does not allow deployment, but lets users see how the design of flows works.

With all these pieces of software, you can build a strong stack for your data science and data analysis without generating any kind of dependency on particular providers.

Autor: María Arias de Reyna Domínguez

This is the blog of María Arias de Reyna.

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *