Modeling uncertainty in exam scores

Published on 9th of November, 2020.

We widely use exams in education to gauge the level of students. The result of an exam is really only indicator of the students actual level, and has a certainly level of uncertainty. In this article we will try to model the uncertainty in the grades of an exam through a Bayesian model, using the amount of points obtained in each question. Such an analysis may be useful in designing good exams, as in principal this error is something we would wish to minimize.

The data

We will use scores for 8 individual questions of a university math exam. We recorded scores of 100 students. Each question is worth 12 points for a total of 96 points, and students could obtain anywhere between 0 and 12 points for each question.

Notice: this data is modified from the original in multiple ways due to privacy reasons. The modifications preserve most qualitative statistical properties._

Modeling the data

Let SS denote the total exam score, and let GjG_j denote the score obtained for the jjth question. Obviously these variables are not independent, high scores in one question are a indicator the student will get high scores in another question as well. For simplicity we want to model the question scores as conditionally independent given the total score SS. The idea behind this is that SS is an unbiased measurement of the students abilities, and the score obtained for a question should only depend on the student’s abilities. This ignores the fact that some questions test similar aspects of the course, and so are more correlated compared to others. If we plot the covariance matrix of the scores we see that there doesn’t seem to be a strong correlation between question scores. We do however see that the 6th question has a higher variance than the other questions

png

Now we need to obtain a good model for the conditional probability P(GjS)P(G_j|S). We will model GjG_j by a binomial distribution, which corresponds to assuming GjG_j consists of 1212 ‘subproblems’, of which the student has equal chance of solving. While this does not seem like a natural assumption, it is very convenient. If mjm_j denotes the total score obtainable for question GjG_j this gives model

P(GjS)P(Gjpj,S)P(pjS), P(G_j|S) \propto P(G_j|p_j,S)P(p_j|S),

with

Gjpj,SBin(pj,mj). G_j|p_j,S \sim \mathrm{Bin}(p_j,m_j).

Next we need to model the conditional distribution P(pjS)P(p_j|S). We tried linear and quadratic regression on logit(pj)\mathrm{logit}(p_j), but this tends to perform badly near the limits S=0S=0 and S=96S=96, and in general tends to shift the predictions towards the global mean. Instead we want a model such that if S=0S=0 then pjp_j is concentrated near zero. To do this we model pjp_j with a beta distribution, whose parameters depend quadratically on SS. More precisely if S^=S/96\hat S= S/96 (which supported on [0,1][0,1]), then we set

pjS,α0,α1j,β0j,β1jBeta(eα0jS+eα1jS2+τ,eβ0j(1S)+eβ1j(1S)2+τ), p_j|S,\alpha_0,\alpha^j_1,\beta^j_0,\beta^j_1 \sim \mathrm{Beta}\left(e^{\alpha^j_0} S+e^{\alpha^j_1} S^2+\tau,\, e^{\beta^j_0} (1-S)+e^{\beta^j_1} (1-S)^2+\tau\right),

where τ=0.01\tau=0.01 is a small regularization parameter. This model has pjp_j constrained close to 00 if SS is small (and similarly pj1p_j\simeq 1 if SS close to 96), and also has enough flexibility to fit our data reasonably well. The exponentials are used both to force the parameters to be positive, and this parametrization also improves the fit. The parameters αi,βi\alpha_i,\beta_i are given normal priors with mean 00 and large sigma, although the fit is not sensitive on this prior.

We will model all the parameters in this model using a Monte Carlo Markov Chain (MCMC) simulation using

pymc3
. With
pymc3
it is very straightforward to implement this model, and requires only a couple lines of code. In the end we obtain samples for the variables GjG_j, and we can model the error in the total score by looking at the distribution of the posterior j=18Gj\sum_{j=1}^{8} G_j.

Results

First we compare the distribution of the predicted exam scores to the actual exam scores. The graph below shows the real scores on the horizontal axis, and predicted scores on vertical axis. The vertically shaded area indicates one standard deviation. The model appears to by unbiased for lower scores, but picks up a slight bias for higher scores. In the original unmodified data this behavior was not apparent, and is therefore likely a byproduct of how the data was modified due to privacy concerns.

png

The goal of this project is to simulate the error in the exam score. Below we plot the the standard deviations of the predicted scores as function of total score. We compare this to a theoretical variance coming from a binomial distribution with 96 trials, and we see that the error is close to this theoretical variance. This means that most of the variance in posterior comes directly from the binomial distributions modeling the question scores GjG_j. We also plot the the variance in the posterior predictive, which is significantly higher. This makes sense, because for the posterior predictive we do not just model the error in each question score, we also have to predict the scores for each question given just the total score.

png

To measure how well our model describes the data, we compare the posterior predictive for each question to the data for each question as function of total score. Here we see that our model describes the distribution quite well for most questions, although in some questions the model underestimates the spread of the data. This is likely because the binomial model is not accurate. To model the questions scores more accurately we would need to step away from the binomial model. However this would make the model more complicated, and the current model does give reasonable predictions for the most part.

png

Evaluation the qualities of questions

We can also use the posterior predictive to qualitatively evaluate the quality of questions. Let us plot the standard deviation and mean of the posterior predictive for each individual question as function of the total score SS below. In a perfect question the variation is as low as possible, and the mean should be relatively close to the straight line. All in all this means that the question provides as much information about the students abilities. Based on this it seems that questions 2,4,8 are quite good, whereas 1,3,6 are not as good. In particular the points obtained for question 6 has the weakest correlation with the total score. Question 5 seems to have been relatively easy, whereas question 3 appears to have been difficult. Analyzing questions in such a manner can be useful for designing future tests, as well as determining what aspects of the course the students found easier or more difficult.

png

png

Other posts you may like

Rik Voorhaar © 2024