- Read Tutorial
- Watch Guide Video
The dead simple definition is that it is the end goal of the system's prediction.
So if you have a dataset that looks something like this this is a data set that represents a small portion of a fleet management database where we have all of these trucks and all of the columns on the left-hand side so year, make, mileage, fuel, repairs, and services each one of those are what are called features those are attributes of the vehicles.
Now in the program that this was used to build we we're trying to see if the status should be active or retired for other vehicles so this is all training data and then we would pump in our new vehicles and see if the vehicle should be active or retired based off of this training data set.
So on the left hand side are all the features on the right hand side that Status column that is the label that was the end goal. It was what we were looking at to see what the prediction should be.
So in this specific example this was used for a decision tree algorithm. And here the goal was to see if a truck should be retired or if it was fine to be active. The features on the left-hand side those are the attributes that helped make up the prediction and then the label on the right-hand side that's the end goal that we were able to look at to tell our algorithm that one was active one was retired.
So in this case our status was a text attribute of active and retired many other cases your label will simply be a zero or a one. You could think of another case study being if you were to build out some type of scale prediction tool the left hand side all of your features would be attributes of your historical customers or your leads who didn't purchase. And then on the right hand side you would simply have a column which would be a label that would be something like did purchase or didn't purchase or a 1 or 0. So it is the end goal of the production.
Now one last item that I want to leave you with when it comes to labels is understanding that labels are one of the key elements that will determine if an algorithm falls into the supervised or unsupervised learning category. So when it comes to most algorithms that you're to be dealing with they're either supervised or unsupervised and if they have a label that means they're going to fall into the supervised learning category and the reason is because you as the developer you're telling the algorithm what is good and what's bad.
And so you're the one because you're giving it that label and you're working with a labeled data set like we have in our example. That means that the algorithms supervised it's supervised by you because you're telling it what the end goal is.
In unsupervised learning algorithm, so with clustering algorithms, you don't have the label. So in other words if you were to want to take this data set and use it in an unsupervised learning algorithm you would remove that Status column because you're not going to be determining if something is good or bad. If it's active or retired you simply want the algorithm to look at all the data analyze it and then place the different records inside of their own clusters.
And so the reason why it's important to understand what labels are is because they determine which type of algorithm that you're going to be implementing and it's all driven by the data set that you're working with.