Banner created using free images in canva.com
Over the past few years, I have interacted with many technology managers and leaders and one common thing I noticed is that not everyone has a correct understanding of Data ingestion. It may be due to the hype created by the data tech industry off late, anything related to data is perceived to be “Running a complex algorithm.”
Hence, I have tried to demystify one of the most widely used Data Engineering terms called “Data Ingestion.”
Let us start with the definition, Data Ingestion is a process of moving data from source to target. It is as simple as that. If you notice, there is no transformation involved in this process and that is the difference between Data Ingestion and ETL.
The term “Data Ingestion "was coined when Open-source technologies started to become popular, and people started to create “Big data” infrastructure and platforms to store huge volumes of data and run computations on them.
Once the Big data platforms were created, Organizations started to move data from multiple sources to one place from where the data can be processed and eventually consumed. This place is called Data Lake and Data Ingestion was used to build Data Lakes. Initially, Data Lakes were created in OnPrem Hadoop clusters and in the last couple of years, the Data Lakes have been created in the Cloud platform.
To achieve Data Ingestion, Organizations usually create a Software Framework which is typically called a Data Ingestion Framework. We will delve into the details of the Ingestion Framework in the later Articles. Let us discuss the types of Data Ingestion now.
Based on the processing methods, it can be classified as follows:
File Ingestion wherein the Source Systems push the files to Ingestion Framework’s landing area and the data is ingested from the landing area to the target location/s.
Database Ingestion wherein the Ingestion Framework connects to the database using JDBC connectivity, extracts the data, and moves to the target location/s.
Based on the scheduling / triggering mechanism, it can be classified as follows:
Batch Ingestion wherein the Ingestion job is scheduled to run at a scheduled time regularly.
Near Real-Time wherein the data is ingested whenever there is a change in the data at the Source side.
There are many Open-source technologies and tools available to perform Data Ingestion and the choice of tools and technologies depends on the Business's needs.
I hope this article serves as the Introduction to Data Ingestion and in the later articles, we will discuss more.
Thanks for reading this article. Please share your views and comments. The content of the article is purely my personal view and in no way reflects my current and earlier Organizations and Vendor partners.