Updated: May 2, 2022
Banner was created using canva.com
The practice of copying data to multiple locations to integrate it has been prevalent in the industry. But, is there a way to see a combined view from all the data sources and silos without physically copying?
We created many data warehouses, data marts and data lakes. We have realised that we have copied data to so many places that maintenance, security and data governance have now become a challenge. Did we at least remove silos? No, we still have data silos. So what is the solution? Is there an alternate? The answer is ‘yes’ and it is called data virtualisation. Though it is not a magic bullet, it is still worth exploring. Let’s discuss it in this article.
Data virtualisation is combining data virtually from different sources into a single, unified view. The data remains in the source system itself and is not replicated anywhere else.
How does it work?
You can achieve data virtualisation by performing the following three actions: Connect, Combine and Serve.
Connect to the data source using the JDBC or HTTP client URL for JSON files, etc.
Combine: Once you connect to the data source, you can extract data and create a base view for each source. You can integrate them into a single unified schema. It happens virtually without replicating the source data physically.
Serve the data to all the consumers, such as data analysts, machine learning engineers, and data scientists. The good thing is that they don’t even know where all this data is coming from, what were their formats originally, etc
Advantages from a data engineer’s perspective:
1. Accelerated delivery: As you don’t create any physical replica of data, data virtualisation is proven to deliver a minimum viable product a lot faster than the traditional data warehouse solutions.
2. Abstraction: Data virtualization uses service-oriented architecture and decouples storage from processing.
3. Secured: As the data virtualisation combines data and serves for consumption from one place, you can implement all data security control governance in one place.
4. Lineage: you can track the data lineage of the virtual target dataset as it is combined centrally.
5. Reuse: You can replicate the same business logic to all the sources, improving developer productivity.
6. POC: You can use data virtualisation as a proof of concept for creating an expensive data warehouse.
7. Change the source data: You can add or remove any columns a lot faster than the ETL processes.
8. Data ownership: As you don’t copy or replicate the data, the ownership of the data assets still lies with the respective source side business.
9. Transform and clean: You can perform transformation and data cleaning activities virtually before you serve them to consumers.
Advantages from a data scientist's perspective:
1. Simplicity: data scientists don’t need to extract data from disparate sources and merge them all on the consumption side as the data is available in a single place.
2. Single source of truth: As the data virtualisation provides a unified schema aggregating data from all sources, it serves as a ready-made single source of truth.
3. Reflect changes to underlying data: As the raw data still stays in the source, the changes in data are reflected at the consumption layer with no additional process.
Points to consider:
1. Data virtualisation is not a replacement for other data integration techniques wherein you need to copy the data physically. You may still need to create a data warehouse and data lakes where you need to perform complex logic for integration or the volume is too high.
2. Performance tuning is key and the key to successful data virtualisation is performance tuning.
3. As the consumption layer serves from many disparate sources, the uptime of your consumption layer should be in synch with the source systems.
4. Similar to data integration techniques, you will need to have a comprehensive data catalogue that describes the metadata of the disparate data stores, like where is the datastore? what does it contain? how frequently is it updated? and so on.
5 The data definition varies from department to department, business to business. However, in order for you to create a meaningful data virtualisation solution, you will need to have a uniform data definition across the organisation.
Hope this gives a high-level understanding of data virtualisation. I have attempted to minimize the usage of jargon and focused on concepts. Thanks for reading. If you find this article useful, please like, share and comment.
Views are personal and in no way reflect my current & previous organisations and vendor partners.
References & Additional Reading: