Free WEKA machine learning algorithms for data mining tool

  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

  In exploring the data analytics tools (Knime, Rapid Miner, FME, Orange..) there has been references to WEKA.

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.(Taken from web page.)

After coming across iNZight which I mention here. I decided to have a look at WEKA. I found that there are some free (freemium) courses that you can take about using Weka for Machine learning. These can be found here.The course is free but if you pay $89 you get access to the course for life and can be graded. For free you have a limited access to the course (7 weeks, (although if you only got partially through it you could always enroll again as a different user to complete it)). As I’m interested in the learning and not the marks, I’m happy with the free version.

The first course, Data Mining with Weka is presented by  Ian Witten of University of Waikato in NZ. It is a practical course and has quizes that you can choose to do that give you familiarity with working with Weka and the provided datasets (that are in a folder in the C:/program files/ directory where Weka is installed).

The step by step process is more practical than theoretical and I personally like the teaching style. I am finding it a very good personal introduction to Machine Learning as well as learning how Weka can be used. Also it is easy to follow along with the videos for the practical exercises (so far). A bit of a pain in that the 2nd course is not being run at the moment and the 3rd course presumes exposure to the 2nd course. I have since found a youtube link to the 2nd course, More Data Mining with Weka videos.  Also the Advanced Data Mining with Weka videos. The original Data Mining with Weka videos.

There is also a streaming programme called MOA for working with streaming data. For working with larger data sets. Also by the University of Waikato. I have not installed or played with it yet.

I also came across another series of Machine Learning Videos from Bloomburg but they did not work for me, but you may find them useful.

Process

Most of the course above was focused on Machine learning tools using the Preprocess and classify tab. Looking at different algorithms and demonstrating their strengths and weaknesses. So, to date I have only been playing with these tabs. The preprocessing tab uploads your data and allows you to view the data tables with the Edit button. It also allows you to apply filters to modify the dataset to prepare it for later processes, so the filters allow for the data cleaning/transforming that may be required to allow the machine learning algorithms to work effectively.

The weka file extension is a .arff file. You can import files (eg .csv files) into Weka and then save as a .arff file.

Preprocess Tab

You can click on attributes (bottom left) and on the right it will show you information about that attribute, eg wether it is nominal or numeral and how many different sets of values it may have , or in the case of numerals its range. Also there is a graph at the bottom right that shows the distribution of those attribute segments

Classify Tab

In the classify tab you can choose the Algorithm that you want to test for machine learning to see how accurate it is on the Training Set.

Here the ZeroR test algorithm is selected, we will select the J48 Trees algorithm.

Once the algorithm is selected, you can adjust the output ( in More options, choose output predictions- Plain Text (or CSV, or HTML or NONE.(default)).

We are pointing at the Class of testing the algorithm and using Cross-validation on 10 folds.  We could use “Use Training Set” or Percentage split (of Training set (it will choose a percentage portion of the training set (and use the rest as a test set)). There is also an option to use a different set of data for a Test Set to see how your algorithm works on a test set after it has been trained on the training set.

If you click on where the J48- C 0.25-M2 is there is a pop-up allowing for further adjustments.

When you have set it up to your requirements, run START Button on left. You will see the output in the Classifier Output panel on the right. At the bottom it shows correctly classified Instances as a number and percentage and at the bottom a confusion matrix. You want all the numbers on the diagonal, any off the diagonal are Incorrectly classified instances.

You are looking for a good fit rather than an overfit . An overfit will not perform well on different datasets.

The nice thing with Weka, is after you have done a run, you can adjust items to tune or explore output, using different algorithm techniques and the runs are on the left panel, so unless you delete them, you have their output to compare with different adjustments.

Visualisation

There are also visualisation tools. In the first course the Visualize tab was not used much, but the boundary visualiser was used to show the different split with different algorithms.

In the example below you can see the 3 different petal lengths depending on genus. Coloured Blue, Green & Red. The red area is quiter descrete from Green & Blue. Whereas there is a bit of mingling between the Blue and the Green.

The way that the Algorithms define which point will fall into the different genus can be shown in Weka using the Boundary visualiser ( that works only on 2 col set of data but good to see the visualisation)

Using the OneR classifier algorithm it has horizontal bands, red at bottom, then green, then blue, if a point falls within those bands it is classified as per those colours.

With the IBk algorithm, which is a nearest neighbour relation, it divides the lines by the first nearest neighbour. You could adjust neighbours to be 2,3 …. etc and it would calculate for best fit. As you cvan see below, when it is trying to fit to the 4 nearest neighbours  the lines are less sharp as its trying to find best fit between eight different points, so lines are more blurred.

 

 

With the J48 tree algorithm it divides differently, but depending on the MinNumObj size it can be adjusted to be more general.

 Package Add-Ins for Weka

There are a variety of package add ins to Weka, some for different Machine learning algorithms, that can be easily added to he basic install setup of Weka.

 

Knime Weka Nodes

Knime just updated to 3.6.0 and I note  that there are Weka nodes available within Knime ( as well as Python scripting nodes too).

 

End Comment

As a first dip into machine learning and Weka I have found this very interesting. The actual running of the tool is easy, its knowing what has just happened that I’m currently trying to figure out.

Weka has an awful lot of things to adjust, ans they are all flying way above my head at the moment. I will do a bit more exploring and work through the 2nd series of videos to get a bit more familiar with the tool.

I can see potential for using machine learning with asset management for predicting when items will need to be replaced, based on historic data as training information.

The courses are a real help to using Weka, much better than the Knime videos, in my opinion. Also having the test data to practice the processes on is good.

Worth exploring.

3 Comments

Add a Comment

Your email address will not be published. Required fields are marked *