The image depicts the unlimited opportunities a woman has in front of her. Banner created using canva.com
In an enterprise setup, a data ingestion framework is used to help govern and manage ingestion efficiently. What is a data ingestion framework? why do we need one? And what should you consider while building a framework? We will discuss them in this article.
1. What is a data ingestion framework?
Data ingestion framework is like a software framework that can be used for various types of data ingestions either by merely having configuration settings or minimal code. We will also talk about Feed and Run, the understanding of which will help comprehend the rest of the article.
A Feed is a setup done in the framework to connect the source to the target. A Run is an activity performed when a Feed is triggered. Hence, a Feed can run many Runs depending upon how many times it was triggered.
2. Why do we need a data ingestion framework?
Ingestion is a straightforward process and you can write a piece of code to move data, rather than build an expensive framework that does the same by calling the same code. Hence, the question arises: why do we need the framework in the first place? I have listed down the top four important reasons.
2.1. Automation:
You will have an ingestion requirement to fulfil from the same source to the target combination regularly. An ingestion framework will help automate the same.
2.2. Regulatory requirements:
Industries that are highly regulated, such as Banking and Pharma, will have to present the audit logs of each activity in the data pipeline to the regulators whenever requested. An ingestion framework will come to your rescue.
2.3.Controls:
By having an ingestion framework, you can control the activities performed in the production servers by providing the right access to the right roles.
2.4.Data integrity:
A framework will enforce you to capture the details required for maintaining data integrity, establishing the source to target data lineage for each feed.
3. Recipe for data ingestion framework:
Building an Ingestion framework consists of many activities. However, make sure that you consider the below while building a data ingestion framework. The list is not exhaustive, but will serve the purpose of building a basic framework.
3.1. GUI Interface:
Develop a GUI framework so that you can set up a feed, control activities, and democratise information about the feed by providing necessary access to the users.
3.2. Know your Run:
Capture basic metadata such as ‘How many Feeds are scheduled? How many Runs are successful? Where is the data coming from? Where is it stored on target? What is the size of the data? Etc,’ and present whenever needed.
3.3. Ensure Quality:
Ensure that a Run is marked successful only after it completes the quality checks. E.g.: the number of rows and columns is the same, both at the source and target.
3.4. Logging & Error Handling:
Capture all information related to a Run in a user-friendly manner so that the data engineer can troubleshoot if there is a failure in the Run, making no assumptions.
3.5. Alerting mechanism:
Depending upon the size of the data, the time taken to ingest ranges from a few hours to several hours. Imagine how good the teams feel, if the framework monitors each Run and alerts them automatically when it fails rather than someone monitoring manually. Based on the alerts, the teams can act as soon as they are aware of any failures.
3.6. Versioning:
After you set up a feed, you may have to amend it in the future based on the business needs. For instance, add or delete a few tables. Hence, incorporate version control of feeds in your framework so that you can trace back the changes and revert if you want to use the old feed settings.
3.7. Smart Rerun:
When the Run fails, your framework should have the intelligence to rerun from the point of failure rather than rerunning from the beginning all over again.
3.8. Load Balance:
If you are running your ingestion framework in On-prem servers, make sure that you have a facility to distribute the processing to different compute resources. A framework without load balancing will pose a challenge when the number of feeds grows over a while.
3.9. Dashboards:
Last but not the least, produce a dashboard that provides a real-time run statistic of the feeds that are running at any point in time.
---------------------------------------------------------------------------------------------------------------------------
I have limited the article only to structured data. I hope it provides a good understanding of the data ingestion framework. Thanks for reading. Please share your views and comments. The content of the article is purely my view and in no way reflects my current and earlier organizations and vendor partners.
---------------------------------------------------------------------------------------------------------------------------
Image Credit: free images from Canva.com
References:
---------------------------------------------------------------------------------------------------------------------------
Comments