What if you don't know the ground truth?

Sujith Chandrasekaran
Jul 1, 2022
3 min read

John, a class 12 student, has studied well, appeared for the revision tests several times, and improved his marks in internal exams. As a result, he is 100% confident that he can score 95% or above in the final board exam. This confidence is based on the assumption that all questions will come from the syllabus he has prepared.

He appeared for the exam, and the board announced they would publish the results after six months. Most of the students awaited results to start to work based on the results.

However, one of the fortune 500 companies hired John based on his internal marks even though the board results are still awaited. So, what is the assumption the organisation makes here?

The assumption is that his score will be more or less the same as the internal exams as long as the syllabus is the same. So before the ground truth is available, i.e. before the board exam results are published, the organisation hires him based on how many questions came from the Syllabus John has been preparing. As all questions came from the syllabus, John is expected to get 95% marks in the board exam based on his internal test results.

Why are we discussing this here? Because this is what we do in machine learning as well.

The ground truth is not often immediately known to assess the model deployed in production. For example, the fraud detection model classifies the customers as fraud or not. But, whether they are actual frauds or not is not known until the Bank's compliance team investigates these customers. This process usually takes six months to one year. It means that you can assess the model performance only after six months. But can we wait until then? What if the model doesn't perform well in production? If the answer is Yes, you would have potentially taken an incorrect decision in those six months. Isn't it a disaster? So what to do? The answer is to check whether or not the data in production has a similar pattern as that of the training dataset. It is like checking whether the board exam question paper contains questions outside the syllabus.

If you find the data is different in production, then the data is drifted. Depending upon how big is the drift, you decide whether to redo the feature engineering and retrain or completely rebuild the model from scratch. Btw, why do we have data drift?

There are potentially three high-level causes for the data drift.

1. Sampling bias: If your training dataset is not the complete refection of the population, then you will find the data drift and a potential drop in performance very soon in production.

2. Non-stationary environment: If you are predicting sales, you should be mindful that seasonal fluctuation can impact sales in the future. Suppose your training data was collected in the year's first half, and the model is deployed to predict the sales in the entire year. In that case, it will encounter a significant data drift in the second half due to major festivals and changes in the purchasing pattern among the customers.

3. Change in the market conditions: This is the most crucial reason for the data drift. Innovation, new product development, digital revolution, Lifestyle changes, and Government policies impact how we do our business, learn, purchase, bank, travel etc. Therefore, the data is bound to change over a while.

Whatever the reason, high data drift means a strong likelihood that the model's performance is compromised. But, how often do we retrain or rebuild? It depends on many things but let me tell you what factors you should consider.

If the data drifts faster than the lag between the prediction time and the time at which the ground truth is known, it is risky to deploy the model in production.

But, if the data drifts slower than the lag, then usually once a year is a good frequency.

In some cases, the ground truth is available immediately within minutes. For instance, the recommendation engines recommend something and if the customer purchases based on the recommendation, the ground truth is generated immediately. Does that mean we will need to assess the performance every minute? Well, it is an overkill; maybe you can set the min time for retraining it once a day or once a week.

I hope you enjoyed reading this article. Please like, share, and comment.

Views are personal.

Image Credit:

Photo by Eren Li: https://www.pexels.com/photo/hispanic-girl-whispering-secret-on-ear-of-friend-7168996/
Benner created using canva.com

References:

https://en.wikipedia.org/wiki/Ground_truth
https://www.oreilly.com/library/view/introducing-mlops/9781492083283/

What if you don't know the ground truth?

Image Credit:

References:

Recent Posts

Comments