The devil is in the details — How your company collects data will determine your success in implementing Machine Learning

5 min readJul 20, 2021

Monday comes around, and since you are back to working from the office, you pull out Waze to let the app dictate the best route to avoid traffic. Unfortunately, halfway through your commute, you find yourself exiting the highway guided by a reroute, which Waze did not alert you of– adding 20 valuable minutes to your drive to work and, most likely, ruining your day. So naturally, you take your anger towards the app and how incompetent it’s been lately. However, the devil is in the details, nested deep into the company’s server, where they constantly gather data.

At the forefront, Waze uses data fed by their users, who constantly report what is happening on the road. Then, they use it to make predictions on the best routes you could take to reach your destination as seamlessly as possible. Still, inside, the process involves a complex Machine Learning system interconnected by algorithms that will ensure no reroutes, traffic, or hazardous items blocking the road are left out. But what happens if your Monday commute misfortune was due to a minuscule dataset with the wrong formatting?

Let’s look at this with a technical eye. First, even though the company’s data is represented in one large table of rows and columns, the variables in the table may have different data types.

Some variables may be numeric, such as ranks, rates, or percentages. Other variables may be names, categories, or labels represented with characters or words, and some may be binary, illustrated with 0 and 1 or True and False. (Brownlee)

The problem is, machine learning algorithms at their core operate on numeric data. They take numbers as input and predict a number as output. Therefore, raw data must be changed before training, evaluating, and using machine learning models. Inconsistency within datasets is the problem many companies face– they lack quality data to run predictive models properly.

There is a high possibility that your company is trying to introduce or has already introduced machine learning to gain a competitive advantage. But before you jump on the ML bandwagon, you need to clearly understand the formatting necessary for the prediction process, the challenges you may face along the way, and the algorithm’s effectiveness in running feedback loops over time.

Set yourself for success

Machine learning algorithms learn a mapping from variables inserted into the model to a target variable on a predictive modeling project. And the rule of thumb of machine learning algorithms lies in a critical detail: most of the effort spent on each data project is on data preparation.

Data preparation can make or break a model’s predictive ability. Different models have different sensitivities to the type of predictors in the model

Applied Predictive Modeling

How effective an algorithm performance is, is directly related to the information supplied to train. Jason Brownlee, Ph.D. in Artificial Intelligence, summarized this as “garbage in, garbage out. “ Sometimes data is easy to find, for example, public sources such as maps or weather information. Users using a product like, say, an Apple watch are also willing to let the company gather their data. If they benefit from doing so– they allow Apple to track their body’s performance through the device in exchange for data they can use to improve their overall health.

Jason Brownlee states that given that machine learning algorithms are routine for the most part, having in mind they have been around for many years, the thing that varies from project to project is the specific data used in the modeling.

Data quality is a common problem in data management since dirty data often leads to inaccurate data analytics results and incorrect business decisions.

Data Cleaning

So even though there are a plethora of sources to extract data from, the key to setting yourself for success when implementing predictive machine learning models is to know your data and ensure it is well-formatted.

Then and Now

According to a survey done by CrowdFlower, data preparation and cleaning take about 60% of data scientists and data analytics time. This amount of time does not even include the time needed to collect and aggregate the required data for a prediction.

New technologies like Datagran have introduced new technologies to take care of this issue giving back valuable time and money to companies to focus on the issue at hand. Now, a company may have multiple data sources scattered across different tools like a CRM and servers. With Datagran, you can centralize various streams into a single pipeline for prediction.

Close the loop

So you have taken care of data quality, and you are confident your company has the correct information to build predictions that will help reach your business’s goals. But how do you tackle the rest of the process?

You may be struggling to begin your journey with machine learning projects, and you are not alone. Some of the most common issues in companies building machine learning are time-consuming deployment, overestimating result delivery, issues with data security, lack of Machine Learning experts, expensive deployment, and data unavailability.

In today’s day and age, Datagran, the first tool to democratize machine learning, is offering a hand to companies that want to introduce the technology without having to go over budget with expensive talent and worry about data security. With its pipeline tool, teams can now build machine learning models and put them into production with little to no code. Moreover, it is a tool that is easy to use, intending to eliminate tedious processes only a fraction of the company used to accomplish.

While data quality is key to executing successful predictive models, the building and deployment of such are essential. Unfortunately, Venturebeat reports that 87% of data science projects never make it to production. In a previous blog post, we went into detail about this. Even though building workflows is not as challenging as deploying them, the problems start piling up when they grow in size due to their very complex and sophisticated structure, which creates a whole new set of challenges of their own. The most significant chunk of work data scientists need to focus on the most is the tasks that come after building and optimizing a model. This chunk of work means companies have 1/3 of the pie assembled, meaning data cleaning, processing, testing, and algorithms in place. Still, the other side of the pie, which is an essential part because you can’t serve 1/3 of a pie, is deploying the results of their workflow to their final destination.

References:

mendez, melissa D. “Challenges of Deploying ML Models.” Datagran’s Blog, Datragran, 11 Feb. 2021, blog.datagran.io/posts/challenges-of-deploying-ml-models.

Lecture 1: Machine Learning Basics — University of Waterloo. http://wavelab.uwaterloo.ca/wp-content/uploads/2017/04/Lecture_1-1.pdf

Brownlee, Jason. “Why Data Preparation Is So Important in Machine Learning.” Why Data Preparation Is So Important, Machine Learning Mastery, 15 June 2020, machinelearningmastery.com/data-preparation-is-important/.

The devil is in the details — How your company collects data will determine your success in implementing Machine Learning

Set yourself for success

Then and Now

Close the loop

Written by Datagran