Some of the screenshots below from the video’s that have been developed and are presented by Ian Witten of Machine Learning Group University of Waikato in NZ. I have not got his permission to use them but cite him as the source of this information.
Earlier lesson takeaways
Training Set & Test Set should be independent. Otherwise you cannot realistically test your classifier to see if it works if you are only using the training data set.
Statistical Mean, variance & standard deviation formulae.
The 10-fold cross-validation method breaking the data set into 10 equal pieces and using 9 pieces to train and the 10th to test, then redoing the process with the test added into the training set and a different piece as the test set. A longer iterative process but you can cross- validate the information. There are subtleties about how you break the data set up into equal sized sets based on a consistency of data in each set ( so one set not all no’s and another not all yes’s etc, try for a balance).
The experimenter allows you to run multiple machine learning algorithms (or variations of that algorithm) over multiple data sets. It has 3 tabs, the setup tap, the run tab and the analyse tab. You can save setup configurations to a file and you can save results to a file (eg a .arff or .csv file).
Select a NEW Experiment configuration (item 4 below).
In the setup tab you can choose data files to include, and you can choose multiple data sets.(left hand bottom tab).
You can choose , in right hand bottom tab the Algorithms that you want to test, and also in what order you want to run them.
You can choose Experiment type and number of iterations too.
Once you have your setup as you wish, you then go to the run tab and hit start and it will run the configurations you have in setup. Then go to the Analyse tab.
In the Analyse Tab hit the EXPERIMENT button (top right) then the PERFORM TEST button to get the results. oin the Test Output panel (bottom right).
In this setup the results are compared with the trees.J48 algorithm, as it was the first. In the left hand panel we are putting a significance of 0.05 (5%). This is then comparing the other 2 methods we have selected against the trees.J48 algorithm to see which performs better and shows whether it is within the significance % we have chosen, if not it shows an “*” to say that it has performed worse. So, if you look at the table below, for the Iris dataset, the rules.ZeroR has only a 33.33% accuracy of prediction compared to the trees.J48 algorithm which has a 94.73% accuracy on the training set. So it is significantly worse , hwence the “*”. Whereas the rules.OneR is 92.53% accurate, so falls within the selected 5% significance (so may perform better on a different training set than the trees.J48 algorithm).
In the example above, the results are comparing the rules.ZeroR and rules.OneR against the trees.J48 algorithm. If we want to compare against the rules.OneR we can change the setup as below.
Also if we wanted to change the row/column output to have the rows as the different algorithms and the columns as the datasets then we can change the configuration by changing the rows from “Dataset” to “Scheme” and the Cols from “Scheme” to “Dataset”.
Knowledge Flow Interface
This is similar to the Knime Node process. You set up a workflow then plug in a dataset and run it and visualise the results.
Select a node from the list and click in the blank area to paste it there.
Right click on the node and choose configure to set up the node.
Add another node and configure it.
Right click on first node and connect it to the 2nd by selecting dataset (in this example , but instance in the last slide).
Connect nodes together
Create a workflow by connecting nodes together for 1/Dataset, 2/ Actions/Processes, 3/ output (Text visualiser/Graphics). Then hit the RUN icon and right click the Visualisers to show results in pop-up boxes.
Once you have the graphical display, you can choose to show it by changing X and Y axis attributes.
The workflow below is based on INSTANCE instead of dataset. This allows data to flow in so could end with an infinite dataset as the data is not stored in memory . (A comment on output. I couldn’t change the graph size , maybe I needed to save the results and in the saved file modify the size).
Command Line Interface (CLI)
I would prefer using the other options so would not use it unless I had to, so I will not go into it. Refer to the video if you are interested. This video discusses CLI and also linking to a Database.
Weka algorithms for Implementation. A comment on the tool
Whilst going through the Data Mining with Weka and More Data Mining with Weka videos that demonstrate the use of Weka in using Algorithms on datasets to predict outcomes. After setting up and pressing the buttons, something happens and a % prediction is given.
For some of the algorithms I cannot see what formula/algorithm is constructed that you can then use to build a predictor in say python or some other programming tool such as Knime.
For Simple Linear Regression you can get the start point and slope of line to predict the y variable. So you can plug the formula in for numeric data to get the results.
In the More Data Mining with Weka videos class 3 relating to Classification Rules, Association Rules and clustering the 2nd video talks about the rules.Part & rules.JRip method of making a Decision Rules. This is the first time I have seen something for nominal data, I believe I could implement, after testing on a Training Set & a Test Set. The demonstration in the lesson had a 74% probability of predicting the class correctly working with 8 attributes to predict the class.
So whilst Weka seems to be a good analysis/testing tool I cannot see a practical way of turning its results into a workflow for implementing easily. I think I would try and reproduce using the algorithms in Knime to build something practical. There is the Knowledge Flow process but I prefer Knime as it is easier to build. So test in Weka, build in Knime would be my thoughts on implementing some Machine Learning Algorithms.
I am quite impressed with the variety of interfaces that Weka has. I like the fact that you can use simple Explorer for starting to explore your data and to add different filters and tests to your data, then move on to the Experimenter to do multiple tests and runs and then also to build workflow nodes (like Knime) to set up some processes and also for streaming data. The Command Line Interface is always a backstop if you have issues with an interface and just need the grunt rather than the visuals./ I personally like the visuals. I really like the flexibility of the interface.