In supervised learning, a machine learning algorithm is trained to correctly respond to questions related to feature vectors. To train an algorithm, the machine is fed a set of feature vectors and an associated label. Labels are typically provided by a human annotator, and represent the right "answer" to a given question. The learning algorithm analyzes feature vectors and their correct labels to find internal structures and relationships between them. Thus, the machine learns to correctly respond to queries.
As an example, an intelligent real estate application might be trained with feature vectors including the size, number of rooms, and respective age for a range of houses. A human labeler would label each house with the correct house price based on these factors. By analyzing that data, the real estate application would be trained to answer the question: "How much money could I get for this house?"
After the training process is over, new input data will not be labeled. The machine will be able to correctly respond to queries, even for unseen, unlabeled feature vectors.
In unsupervised learning, the algorithm is programmed to predict answers without human labeling, or even questions. Rather than predetermine labels or what the results should be, unsupervised learning harnesses massive data sets and processing power to discover previously unknown correlations. In consumer product marketing, for instance, unsupervised learning could be used to identify hidden relationships or consumer grouping, eventually leading to new or improved marketing strategies.
This article focuses on supervised machine learning, which is the most common approach to machine learning today.
Supervised machine learning
All machine learning is based on data. For a supervised machine learning project, you will need to label the data in a meaningful way for the outcome you are seeking. In Table 1, note that each row of the house record includes a label for "house price." By correlating row data to the house price label, the algorithm will eventually be able to predict market price for a house not in its data set (note that house-size is based on square meters, and house price is based on euros).
Table 1. House records
|Size of house||Number of rooms||Age of house||Estimated cost of house|
|90 m2 / 295 ft||2 rooms||23 years||249,000 €|
|101 m2 / 331 ft||3 rooms||n/a||338,000 €|
|1330 m2 / 4363 ft||11 rooms||12 years||6,500,000 €|
At early stages, you will likely label data records by hand, but you could eventually train your program to automate this process. You've probably seen this with email applications, where moving email into your spam folder results in the query "Is this spam?" When you respond, you are training the program to recognize mail that you don't want to see. The application's spam filter learns to label future mail from the same source, or bearing similar content, and dispose of it.
Sign up for Computerworld eNewsletters.