Regression of Results by Subgroup used to Predict using New Data using R

Introduction

In this article, we will explore how to use a regression model in R to predict a specific outcome based on various predictor variables. We will focus on the concept of subgrouping and how it can be used to improve prediction accuracy.

We will start by creating a dummy dataset that represents our real-world data. This dataset will contain three columns: StudentNumber, SubjectCode, and two assessment marks (ExamMark and AssessmentMark). Our goal is to use these variables to predict the value of hmkk.

Data Preparation

To get started, we need to prepare our dummy dataset in R.

# Load necessary libraries
library(ggplot2)

# Create a dummy dataset
LMTESTData = data.frame(
  StudentNumber = 1:100,
  SubjectCode = c("A","B","C","D","E"),
  hmkk = rnorm(mean=72, 100),
  ExamMark = rnorm(mean=62, 100),
  AssessmentMark = rnorm(mean=68, 100)
)

# Check the structure of our dataset
str(LMTESTData)

As you can see from the str() function output, our dataset has four columns: StudentNumber, SubjectCode, hmkk, and two assessment marks (ExamMark and AssessmentMark).

Linear Regression Model

Next, we will create a linear regression model to predict the value of hmkk based on the three predictor variables.

# Create a linear regression model
LMTESTModel <- lm(hmkk ~ ExamMark + AssessmentMark, data = LMTESTData)

# Summarize our model
summary(LMTESTModel)

The output from the summary() function will provide us with information about our model, including the coefficients and standard errors for each predictor variable.

Coefficient Extraction

Now that we have created our linear regression model, we need to extract the coefficients for each predictor variable. This can be done using the sapply() function in combination with the lapply() function to split our data into groups based on SubjectCode.

# Use lapply and sapply to extract coefficients by SubjectCode
LMTESTCoefficients <- sapply(split(LMTESTData, LMTESTData$SubjectCode), function(d) 
                            coef(lm(hmkk ~ ExamMark + AssessmentMark, data = d)))

# Print our coefficients
print(LMTESTCoefficients)

This will output an array where each element corresponds to a specific SubjectCode and contains the coefficient(s) for that group.

Predicting with New Data

Now that we have extracted the coefficients for each predictor variable by SubjectCode, we can use these values to predict the value of hmkk based on new data. We will create a new dataset that represents our real-world data, but this time without the hmkk column.

# Create a new dataset with predictor variables only
NEWDATA <- data.frame(
  SubjectCode = c("A","B","C","D","E"),
  ExamMark = rnorm(mean=62, 100),
  AssessmentMark = rnorm(mean=68, 100)
)

# Predict hmkk values for our new dataset using our coefficients
NEWHMMM <- sapply(NEWDATA$SubjectCode, function(x) 
                  LMTESTCoefficients[x, ] %>% sum %>% predict(NEWDATA$ExamMark, NEWDATA$AssessmentMark))

# Print our predictions
print(NEWHMMM)

This will output the predicted hmkk values for each group in our new dataset.

Conclusion

In this article, we have explored how to use a regression model in R to predict a specific outcome based on various predictor variables. We have also discussed the concept of subgrouping and how it can be used to improve prediction accuracy. By extracting coefficients by SubjectCode, we can create more accurate predictions for our new data.

In future articles, we will continue to explore advanced topics such as regularization and model selection.

Last modified on 2024-03-23