Data Ingestion Best Practices

Sujith Chandrasekaran
Mar 23, 2022
4 min read

Updated: May 2, 2022

Photo by Sujithkumar Chandrasekaran at KBR park Hyderabad, India - Banner was created using canva.com

You may have the most fuel-efficient car in the world, but if you are not driving at optimal speed, you will end up spending too much on fuel.

The same is true with data ingestion as well. You may have the best ingestion framework, but if you don’t use it properly, you will encounter issues in the long run. In this article, I will try articulating 10 best practices of data ingestion. I am sure you will find them useful as they will help you minimise 'Technical debt' in your area.

1. Use delta instead of full volume:

Wherever possible, ingest the data that has been altered since the last run. Banking transaction data is a good example of this. The transactions carried out in the past will not change. Hence, instead of ingesting the full transaction table every time, ingest only delta, i.e. the transaction data generated from the last Run date. Subsequently, you can append the delta with the data as of the last run to get the snapshot data.

2. Don’t use a data warehouse as a source:

Don’t use a data warehouse as a source while ingesting into a data lake. You would have constructed a data warehouse for a specific use case. This means the data is transformed and is not in a raw form in the data warehouse. However, data lakes are constructed to store raw data. Therefore, if you source from a data warehouse, you will end up creating another version of a data warehouse rather than a data lake. Hence, use the OLTP database as a source as it contains raw data.

3. Schedule logically:

Set the feed frequency based on how often data is made available and/or consumed. For example, when the input files land once a day, setting up the feed that triggers every hour is unproductive till the next day. Avoiding too frequent cyclic feeds when not called for, benefits cut down on unnecessary resource utilization.

4. Use a naming convention:

Develop and adhere to an appropriate feed naming convention. It will help you figure out the feed at a high level by the name itself. For example, the name of the feed can be <source>_<target>_<frequency>.

5. Divide and rule:

If the feed involves hundreds of files, divide the feed into logical chunks and set up a separate feed for each chunk. This will help deal with failures efficiently.

6. Use a single schema:

A source database can have multiple schemas. When you are performing database Ingestion, set up separate feeds for each database schema rather than calling tables from all schemas in one feed.

7. Choose the right timing :

If you set up all your feeds at the same time, it will create a traffic jam kind of a situation.

Distribute the feeds to different timings so that the server utilization doesn’t reach the peak at any point in time. Server utilization reaching peak will hold one or many feeds as the feeds will have to await resource availability in the server to run. For example, if 9 AM is a peak time in your region, distribute the feeds to run, let’s say, at 6 AM, 7 AM, 8 AM 9 AM, etc.

8. Simplify your feed:

8.1. No transformation: Ingestion should be a straightforward process. Don’t perform any complex transformation or filtering as part of ingestion. If you require filtering, do it as the next step in your data pipeline.

8.2. One Source, one feed: If not regulated, the landing area in your ingestion framework may receive files from many source systems. Hence, make sure that you set up one feed per source.

9. Use DNS rather than IP Address:

DR switchover is a periodic process performed at the source database side to divert the traffic from the Production box to the Disaster Recovery box. If you are using the physical IP address to connect to the database, the connection will fail, as the Primary database would have been switched over after the periodic DR switchover. Therefore, use DNS (domain name system) instead of using the physical IP address in your connection string. This will ensure that the feed always points to the database which is ‘Up and running’.

10. Ingest once consume many times:

Instead of ingesting the same data many times, ingest once in a staging area and consume for all use cases. This will help conserve lots of space and eliminate data redundancies.

-----------------------------------------------------------------------------------------------------------------------

I am sure you might know most of the points. However, I have attempted to collate so that it will serve as a ready reckoner. Thanks for reading. If there is anything that you want me to add/amend, please share your views in the comments section. I have limited the article only to batch and structured data.

The content of the article is purely my view and in no way reflects my current and earlier organizations and vendor partners.

--------------------------------------------------------------------------------------------------------------------

Image Credit:

Clock: Photo by Andrey Grushnikov from Pexels

Traffic jam : Photo by Oleksandr Pidvalnyi from Pexels

Railway track: https://www.pexels.com/photo/railroad-tracks-in-city-258510/

References:

https://www.sujithchandrasekaran.com/post/data-ingestion-demystified

https://www.sujithchandrasekaran.com/post/data-ingestion-framework-simplified

https://en.wikipedia.org/wiki/Technical_debt

https://www.oracle.com/database/what-is-oltp/

https://www.tutorialspoint.com/dbms/dbms_data_schemas.htm

https://en.wikipedia.org/wiki/IP_address

https://www.cloudflare.com/en-in/learning/dns/what-is-dns/