In this blog post we will try to predict chronic kidney disease using various attributes collected from hospitals. Chronic kidney disease (CKD) is a condition characterized by a gradual loss of kidney function over time, which may lead to kidney failure.
We are using the UCI Chronic Kidney Disease data set from the Data Science Experience community. You can get data into your project in two steps:
- Go to the Data Science Experience community . You can also navigate to the community from DSX by clicking the Community tab on the top panel.
2. Select the data set from the community and click the add (+) icon on the community card. Select your project and click Add.
After you add the data set to your project you can find it in the Data assets section under the Assets tab.
Data Science Experience offers an array of options for working with your data. To model this problem and understand the factors affecting chronic kidney disease, we will use IBM SPSS Modeler Flow in DSX. IBM SPSS Modeler Flow is a graphical interface to create different machine learning flows.
Create SPSS Modeler Flow:
To create an SPSS Modeler Flow, go to the Assets tab as shown above and click the new flow icon under the SPSS Modeler Flow section. Give a name and description to the flow and select the IBM SPSS Modeler runtime.
Nodes In IBM SPSS Modeler Flow:
Before starting with the analysis, let’s have a look at different node options available in SPSS Modeler Flow.
On left side panel (Nodes Palette) you can see different types of nodes available for you to use while working on your data. There are six types of node categories:
- Record Operations: As the name suggests, you can use them to perform operations such as selecting, appending, sorting on the record (row) level.
- Field Operations: These nodes are helpful in the data preparation phase. You can filter data, rename features, and choose the type of your attributes.
- Graphs: Nodes in this section will help you with basic data exploration and understanding distribution or relationship between features.
- Modelling: These nodes provide different modeling algorithms for different types of problems.
- Outputs: These nodes are helpful in understanding your data and model. You can display results in table format or get a report on evaluation parameters of your model.
- Export: After processing and modeling, this node will help you export data from the flow editor to your DSX project.
Drag and drop the node into the canvas and right-click to take further actions such as open, preview, or run.
To start working on the problem, first we need to get data into the canvas. It is as easy as drag-and-drop. To preview the file, right-click on a node and select Preview.
There are a few values missing from our data. Let’s dig deeper into summary statistics of our data using the Data Audit node. Drag the Data Audit node and connect it with the data node. We have to open the node to change settings or give a custom name.
After running the node you can see your audit report on right side panel.
In the data, some features have more missing values compared to others. Let’s drop those features using the Filter node, and then we will drop rows with missing values using the Select node. In this way, we can retain the maximum number of records.
If you decide to impute these missing values, the Filler node will help you do that.
Once our data is clean, we can set our class variable as the target variable using the Type node. It will help our model to distinguish between input and target features.
Let’s take a quick look at the distribution of our target variable. Drag the Distribution node from the graphs section of the node palette and provide field information under settings.
In our data we have more non-chronic kidney disease cases than chronic kidney disease cases.
One more step before building the classification model is to divide data into train and test sets. We will use the Partition node for this.
Now let’s fit the classification model. We will be using a C5.0 algorithm to build a decision tree . A C5.0 model works by splitting the data based on the field that provides the maximum information gain.You can see node C5.0 under the Modeling section of the nodes palette.
While building this model we don’t have to specify input and output variables. We have already done that in the Type node. Once you run your decision tree model you will be able to see your model in a golden color node.
Right-click on the golden color node and view the model. You can see predictor importance, tree digram, and other model information here.
To evaluate the performance of the model, select the Analysis node from the Output section of the node palette and connect it with the model. Similarly, use the Table node to view data in a table format with predicted labels and confidence.
This is the analysis report for our model.We have achieved 97% accuracy on our test data set with this model.
Now it’s time to save our model. Right click on a terminal node in the flow (e.g. analysis/table nodes) , click on save as a model option and provide model name to save this SPSS model to our project.