Realtime data ingestion-an overview
Updated: May 2, 2022
Banner created using free images in canva.com
We collect data virtually everywhere and every time we are online. We are swimming in the data. Are we using the data for our decision-making? The answer is ‘yes’. However, are we using the data at the right time? The answer is not always ‘yes’. The reason is that the decision-making systems are not getting the transactional data as soon as we captured it.
To give a rudimentary analogy, imagine a recommendation engine receives the browsing data by the end of the day, can recommend a product only by the end of the day. By that time, the potential customer would have already purchased what you wanted to recommend.
So, you can make the best use of the data only when you use it at the right time. In this analogy, the right time is when the potential customer is browsing.
Necessity is the mother of invention and we have now real-time data ingestion that can help you address this. What is it? How is it done? We will discuss this in this article.
What is real-time ingestion?
The process of moving data from source to target as soon as it is captured is called Real-time data ingestion.
Why is it important?
In a few use cases, the value of the data decreases with the latency. Hence, you must analyze the data and act on it as soon as you capture the data.
How do you achieve it?
Whenever you capture data, you will store it in the transactional database. As soon as you store the data in the database, trigger an ingestion feed to ingest the data to the data lake on which the decision-making system works. You can achieve this by using a process called “Change Data Capture” (CDC). So what is CDC?
Change data capture (CDC) refers to identifying and capturing changes made to data in a database and then delivering those changes in real-time to a downstream process or system.
CDC provides real-time or near-real-time movement of data by moving and processing data continuously as new database events occur. The database event could be any DML operations such as insert, update and delete. The data from the database events are captured using the event streaming platform (ESP) for real-time ingestion. So what is ESP?
An Event Streaming Platform (ESP) is a highly scalable and durable system capable of continuously ingesting gigabytes of events per second from various sources. The data collected is available in milliseconds for intelligent applications that can react to events as they happen.
The below diagram will help you comprehend a lot easier:
Advantages of real-time ingestion over Batch Ingestion:
Data is available for analysis and decision-making as soon as it is captured.
Compared to batch ingestion, it reduces the load on the source system as the data ingested is in a small volume.
Helps in reducing the data duplication as we do the real-time ingestion using the changed/altered data not using full volume.
If there is a failure in the pipeline, you can rerun the feed from the point of failure, and hence it is a lot faster than rerunning the batch job.
Note: I have described advantages compared to batch ingestion. Real-time ingestion enables Real-time analytics and that brings in a lot of benefits to the Business but I have not captured them here in this article.
Popular use cases:
The real-time data ingestion use cases depend on the business needs. However, a few examples are as mentioned below:-
Real-time Customer Relationship Management applications that track the time, location, and purchases of the customers to provide offers by the merchant outlets in real-time.
Real-time fraud detection systems should know the fraudulent transitions almost immediately to action, such as calling the customer and blocking the card.
Real-time trading apps that alert when the upper or lower limits are achieved to take action, for instance, either sell or buy the stocks.
Real-time monitoring systems in the hospitals track the patients’ vitals in real-time and predict their health conditions and alert the emergency staff.
There are many use cases wherein real-time, ingestion is essential. However, it is always advisable to use this based on actual requirements rather than doing it for the sake of doing it, as it is more expensive than a batch processing job. Hope this gives a bird’s-eye view of what is real-time ingestion and its uses. I have limited this article only to structured data. Thanks for reading. I have made a conscious effort to, minimize the usage of jargon and focused on concepts. Please share your views in the comments section. The content of the article is based on my views and in no way reflects my current and earlier organizations and vendor partners. ———————————————————————————————————— Image Credit:
Photo by Vikram Sundaramoorthy from Pexels
Photo by Amina Filkins from Pexels
References: https://www.qlik.com/us/change-data-capture/cdc-change-data-capture https://www.striim.com/change-data-capture-cdc-what-it-is-and-how-it-works/ https://medium.com/event-driven-utopia/anatomy-of-an-event-streaming-platform-part-1-dc58eb9b2412 https://www.sujithchandrasekaran.com/post/data-ingestion-framework-simplified