With all the machine learning different algorithms (KNN, Naive Bayes, k-Means etc), how can you choose which one to use?
First, you need to consider your goal. What are you trying to get out of this? (Do you want a probability that it might rain tomorrow, or do you want to find groups of voters with similar interests?) What data do you have or can you collect? Those are the big questions. Let’s talk about your goal.
- If you’re trying to predict or forecast a target value, then you need to look into supervised learning. If not, then unsupervised learning is the place you want to be.
- If you’ve chosen supervised learning, what’s your target value? Is it a discrete value like Yes/No, 1/2/3, A/B/C, or Red/Black/Yellow?
- If so, then you want to look into classification.
- If the target value can take on a number of values, say any value from 0.00 to 100.00, or -999 to 999 then you need to look into regression.
- If you’re not trying to predict a target value, then you need to look into unsupervised learning.
- Are you trying to fit your data into some discrete groups? If so and that’s all you need, you should look into clustering.
- Do you need to have some numerical estimate of how strong the fit is into each group? If you answer yes, then you probably should look into a density estimation algorithm.
- The rules we’ve given here should point you in the right direction but are not unbreakable laws.
The second thing you need to consider is your data.
You should spend some time getting to know your data, and the more you know about it, the better you’ll be able to build a successful application. Things to know about your data are these:
- Are the features nominal or continuous?
- Are there missing values in the features? If there are missing values, why are there missing values?
- Are there outliers in the data?
- Are you looking for a needle in a haystack, something that happens very infrequently?
- All of these features about your data can help you narrow the algorithm selection process.
With the algorithm narrowed, there’s no single answer to what the best algorithm is or what will give you the best results. You’re going to have to try different algorithms and see how they perform. There are other machine learning techniques that you can use to improve the performance of a machine learning algorithm. The relative performance of two algorithms may change after you process the input data.
Many of the algorithms are different, but there are some common steps you need to take with all of these algorithms when building a machine learning application.
Steps in developing a ML application
Our approach to understanding and developing an application using machine learning in this article will follow a procedure similar to this:
1. Collect data.
You could collect the samples by scraping a website and extracting data, or you could get information from an RSS feed or an API. You could have a device collect wind speed measurements and send them to you, or blood glucose levels, or anything you can measure. The number of options is endless. To save some time and effort, you could use publicly available data.
2. Prepare the input data.
Once you have this data, you need to make sure it’s in a useable format. The benefit of having this standard format is that you can mix and match algorithms and data sources. You may need to do some algorithm-specific formatting here. Some algorithms need features in a special format, some algorithms can deal with target variables and features as strings, and some need them to be integers.
3. Analyze the input data.
This is looking at the data from the previous task. This could be as simple as looking at the data you’ve parsed in a text editor to make sure steps 1 and 2 are actually working and you don’t have a bunch of empty values. You can also look at the data to see if you can recognize any patterns or if there’s anything obvious, such as a few data points that are vastly different from the rest of the set. Plotting data in one, two, or three dimensions can also help. But most of the time you’ll have more than three features, and you can’t easily plot the data across all features at one time. You could, however, use some advanced methods to distill multiple dimensions down to two or three so you can visualize the data.
4. Select the data.
If you’re working with a production system and you know what the data should look like, or you trust its source, you can skip this step. This step takes human involvement, and for an automated system you don’t want human involvement. The value of this step is that it makes you understand you don’t have garbage coming in.
5. Train the algorithm.
This is where the machine learning takes place. This step and the next step are where the “core” algorithms lie, depending on the algorithm. You feed the algorithm good clean data from the first two steps and extract knowledge or information. This knowledge you often store in a format that’s readily useable by a machine for the next two steps. In the case of unsupervised learning, there’s no training step because you don’t have a target value. Everything is used in the next step.
6. Test the algorithm.
This is where the information learned in the previous step is put to use. When you’re evaluating an algorithm, you’ll test it to see how well it does. In the case of supervised learning, you have some known values you can use to evaluate the algorithm. In unsupervised learning, you may have to use some other metrics to evaluate the success. In either case, if you’re not satisfied, you can go back to step 4, change some things, and try testing again. Often the collection or preparation of the data may have been the problem, and you’ll have to go back to step 1.
7. Use it.
Here you make a real program to do some task, and once again you see if all the previous steps worked as you expected. You might encounter some new data and have to revisit steps 1–5.