Decision Trees: A survey was sent to the employees of a large company to ask them the following questions:
- Do you work in the data analytics department? (Y or N)
- Are you above the age of 30? (Y or N)
- Have you spent more than 5 years in this company? (Y or N)
- Is your current gross income more than USD 50,000 per year? (Y or N)
The following table summarizes the responses to the survey. For each entry, “Number of Instances” represents the number of respondents having the corresponding values for the attributes Analytics
Department, Age>30, and Tenure>5.
Analytics Deparment Age>30 Tenure>5 Number of Instances of
- Income > 50K
- Number of Instances of
- Income ≤ 50K
- Y Y Y 25 0
- N Y Y 15 0
- Y N Y 10 5
- Y Y N 0 0
- N N Y 0 0
- N Y N 25 15
- Y N N 0 10
- N N N 0 20
Given the data above, answer the following questions:
(a) Find support and confidence for the rule: if Analytics Department = Y Then Income > 50K
(b) Find support and confidence for the rule:
if Analytics Department = Y and Tenure > 5 Then Income > 50K
(c) Using the 1-rule method discussed in class, find the relevant sets of classification rules for the target variable by testing each of the input attributes Analytics Department, Age > 30, and Tenure > 5.
Which of these three sets of rules has the lowest misclassification rate?
(d) Considering Income>50K as the target variable, which of the attributes would you select as the root in a decision tree that is constructed using the information gain impurity measure?
(e) Use the Gini index impurity measure and construct the full decision tree for this data set.
2Exploratory Data Analysis: This exercise relates to the household income and expense dataset available on Blackboard as “Inc Exp Data.csv”. The data was taken from Kaggle and has 7 variables related to the income and expense details of households The following table defines the variables in the data: Variable Name Description
- Mthly HH Income Monthly household income
- Mthly HH Expense Monthly household expenses
- No of Fly Members Number of family members
- Emi or Rent Amt Rent or mortgage installment amount
- Annual HH Income Annual household income
- Highest Qualified Member Academic qualification of highest qualified family member
- No of Earning Members Number of earning family members
Load the dataset into R and answer the following questions:
(a) How many rows and columns are in the dataset?
(b) Convert the variable “Highest Qualified Member” to a factor variable. Print the summary of dataset and explain the key points of the summary for “Mthly HH Income” and “Highest Qualified Member”.
(c) Calculate the mean and standard deviation of all numeric columns.
Hint: Use dplyr package to filter only numeric columns using the is.numeric filter and then generate summary statistics.
(d) Calculate disposable income of households as the difference between monthly income and expenses.
Plot a histogram of disposable income with 10 breaks.
Hint: Use the hist function and look at the help file for the “breaks” argument
(e) Construct a boxplot for monthly household income against the highest qualified member in a household. Your boxplots should be in the sequence illiterate, undergraduate, professional, graduate, post-graduate.
Hint: You may need to redefine the levels of the factor variable “Highest Qualified Member”. Use the levels argument in the factor command. Use the boxplot function. You should get 5 box plots in the same chart.
(f) For families with no more than 4 family members, calculate average monthly household income by highest qualified member using dplyr. Then, create a bar chart using ggplot2 demonstrating the same information.
Hint: Use chaining for dplyr filter, group by and summarize and pass it to the ggplot function.
3Logistic Regression This exercise relates to the diabetes dataset available on Blackboard as diabetes.csv.
It contains demographic and medical data for 768 females over the age of 21. The variables are defined below:
- Variable Name Description
- Pregnancies Number of times pregnant
- Glucose Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure Diastolic blood pressure (mm Hg)
- SkinThickness Triceps skin fold thickness (mm)
- Insulin 2-Hour serum insulin (mu U/ml)
- BMI Body mass index ((Weight (in kg))/(Height (in 𝑚2)))
- DiabetesPedigreeFunction Diabetes pedigree function
- Age Age (in years)
- Outcome Class variable (0 if no diabetes, 1 if individual has diabetes)
Answer the following questions:
(a) Load the data into R. Print the structure of the dataset and explain the output.
Hint: Use the read.csv and str commands. This can be done in 2 lines of code.
(b) Convert the variable Outcome into a factor variable. Print the frequency distribution of the Outcome variable using the table command and explain what it means.
Hint: Use the as.factor and table commands. You only need two lines of code for this.
(c) Create your training set with a random selection of 70% of the rows in the dataset and your testing set with the other 30%. Use seed value 123 for this randomization. Print the frequency distribution of the outcome variable in both train and test data. Are the two datasets similar in terms of the distribution of the outcome variable? Explain.
Hint: You can use the sample command for the split. You will also need the set.seed command.
(d) Train a logistic regression model on the training dataset. How many of the variables are significant?
Hint: Use the glm and summary commands to for this part.
(e) Generate predictions on the testing dataset using the model produced through logistic regression in step 5. Report the confusion matrix of your logistic regression model on the train set when the threshold is set to 0.25. Compute the accuracy, true positive rate, and false positive rate for the model.
Hint: You can use predict function for generating testing predictions, an ifelse command to create binary predictions, and table to create a confusion matrix. This should take only 3 lines of code.
(f) Generate ROC plots and precision recall plots for both, the training and the testing dataset. Report the area under the curve and also attach the plots in your final submission. Provide brief explanations of what each curve and their respective AUCs represent.
Hint: Use the ROCR library.
(g) An individual displays the following traits: pregnancies = 1, glucose = 130, blood pressure = 80, skin thickness = 22, insulin = 100, BMI = 25, diabetes pedigree function = 0.5, age = 50. According to your final model, what is the probability that the individual has diabetes? Show your working.
Note: This is a manual calculation. Do not do this part with R. You can round the coefficient estimates to 2 decimal places for ease of work.
Decision Trees: Following the steps defined below, create a decision tree model to predict whether an individual has diabetes:
(a) Using the training and testing partitions created in question 3, create a decision tree model on your training data to predict the ”Outcome” variable using the rpart function. In your console, print the decision tree model just made and explain how to read the output and what each value means.
You don’t have to explain every node. Just a few terminal nodes to show you understand how to interpret the output.
Hint: Use the rpart library for this part.
(b) There are some parameters that control how the decision tree model works. These can be accessed in the help file of rpart. Create a decision tree model where every terminal node has at least 25 observations. Do you notice any difference between this model and the model created in part (6) above? Explain.
Hint: Type ”?rpart” to bring up the help file and scroll down to controls. You will see a hyperlink titled ”rpart.control”. Click on the hyperlink and read the help file.
(c) Plot the decision tree model obtained in part (b) of this question using rpart.plot.
(d) Predict the probability of having diabetes for each observation in both training and test data. Create the ROC plot and precision recall curves and report the area under the curve for all curves.
Hint: You can use the predict function and ROCR library as in Q2.
(e) Compare the output of part (d) of this question to part (f) of question 3. Which model is better? Why?
(f) An individual displays the following traits: pregnancies = 1, glucose = 130, blood pressure = 80, skin thickness = 22, insulin = 100, BMI = 25, diabetes pedigree function = 0.5, age = 50. According to your final model from part (4) above, what is the probability that the individual has diabetes? Explain.