Sunday, November 11, 2012

Matching Test Information

# Imagine that you are asked to create a test from an item pool that you have.

# The item pool has 100 items and you have to create a test of 20 items in length.

# Easy right?

# The trick is that the test creators want the information on the test to match that of previous tests.

# They give you points on the information curve that your test must approximate in terms of information.  These points are given for every half of a point change in theta.

info  = c(0, .05, .1,  .5, 1.4, 3.5, 5.8,  6, 6.25,  6, 5.8, 5.5,4.5,  3,1.75, .5 , .25)
theta = c(-4, -3.5, -3,-2.5, -2 ,-1.5, -1 ,-.5,    0, .5,   1, 1.5,  2,2.5,   3, 3.5,   4)

plot(theta, info, type="b", main="Test Information Curve to Match", ylab="Information", ylim=c(0,7), lwd=3)

 lines(theta, info-.5, col="red", lwd=3)
 lines(theta, info+.5, col="red", lwd=3)

 # You are allowed a .5 above or below movement in actual information.




# You are given a set of 100 items.  I will randomly generate items for a 3PL model, however your items would be given to you.

npool = 100

pool <- data-blogger-escaped-a="abs(rnorm(npool)*.25+1.1)," data-blogger-escaped-b="rnorm(npool)," data-blogger-escaped-c="abs(rnorm(npool)/7.5+.1))</p" data-blogger-escaped-cbind="cbind" data-blogger-escaped-item="1:npool,">summary(pool)


# You have to choose 20 items from these 100 to construct a test which has information that best matches the target information function.

nitems = 20

# Looking good.

# Each item has a item characteristic curve (ICC) of:
PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b)))

# and information function defined as:
PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c)

# There are many potential ways of solving this problem.  I propose to devise a maximization routine that evaluates each potential item and how it helps to shrink the difference between the target test information and that of the current.  Each item will be added individually until the test has 20 items.

# First I want to calculate the information at each theta for each item.
nthetas = length(theta)

# This matrix will contain the item pool information values
pool.info <- data-blogger-escaped-matrix="matrix" data-blogger-escaped-ncol="nthetas)</p" data-blogger-escaped-nrow="npool,">

colnames(pool.info) <- data-blogger-escaped-heta=",theta,sep=" data-blogger-escaped-p="p" data-blogger-escaped-paste="paste">
# These must be calculated for each item but for all thetas simultaneously.
for(i in 1:npool) {
  pool.info[i,] <- data-blogger-escaped-i="i" data-blogger-escaped-p="p" data-blogger-escaped-pl3.info="pl3.info" data-blogger-escaped-pool="pool" data-blogger-escaped-theta="theta">}


criteria <- data-blogger-escaped-function="function" data-blogger-escaped-item.info="item.info" data-blogger-escaped-p="p" data-blogger-escaped-scale="1)" data-blogger-escaped-test.gain="test.gain">  # test.gain is the potential information that the test still needs
  # If this information is greater than 0 then this item can add as much as test.gain to the overall test information.
  if (test.gain&gt;0) return_score=min(test.gain^scale, item.info^scale)
  # If the test has nothing to gain, then this item will add nothing.
  if (test.gain&lt;=0) return_score= 0
  return_score
}

criteria <- data-blogger-escaped-function="function" data-blogger-escaped-item.info="item.info" data-blogger-escaped-p="p" data-blogger-escaped-scale="1)" data-blogger-escaped-test.gain="test.gain">  if (test.gain-item.info&gt;0) return_score=test.gain^scale
  if (test.gain-item.info<0 data-blogger-escaped-return_score="max(test.gain,0)^scale-2*(item.info-test.gain)^scale</p">  return_score
}

# I generated several different sets of criteria for evaluating each item.

# I found the second criteria to be the best because it gives weight to gains approaching the target score but also give double that weight in gains above the target score.  This is useful because information can be added but the maximization routine does not subtract.

# I have added a z loop into the program to have the program scan over multiple scale factors for solutions.
zz = zfail = zsum = zvalue=0
zseq = seq(.25,15,.10)
# This is a fairly fine gradient to search over.

results <- data-blogger-escaped-matrix="matrix" data-blogger-escaped-ncol="nitems+3," data-blogger-escaped-nrow="length(zseq))</p">colnames(results) <- data-blogger-escaped-ails="ails" data-blogger-escaped-c="c" data-blogger-escaped-ifference_sum="ifference_sum" data-blogger-escaped-nitems="nitems" data-blogger-escaped-p="p" data-blogger-escaped-paste="paste" data-blogger-escaped-sep="" data-blogger-escaped-value="value">

for (z in zseq) {
zz=zz+1

# Let us create the empty test pool.
test.info <- data-blogger-escaped-matrix="matrix" data-blogger-escaped-ncol="nthetas)</p" data-blogger-escaped-nrow="nitems,">colnames(test.info) <- data-blogger-escaped-c="c" data-blogger-escaped-nfot=",theta, sep=" data-blogger-escaped-p="p" data-blogger-escaped-paste="paste">

# The current copies start out as copies of the pool and the pool's information but they shrink as more items are selected.
pool.current = pool
pool.info.current = pool.info

# Now we are ready to think about starting to select items for our test.
# Initially we have no information.  We would like to select the item that satisfies our criteria for best item.

scale.power = z

# Generate an matrix to hold the score for the current pool of items
pool.score <- data-blogger-escaped-p="p" data-blogger-escaped-pool.info.current="pool.info.current">
# First let's figure out how much we need to gain on the test.
test.gain = (info-test.info[1,])

# Calculate the scores for all items
for(i in 1:nrow(pool.current)) for(ii in 1:nthetas) pool.score[i,ii] <- data-blogger-escaped-nbsp="nbsp" data-blogger-escaped-p="p">            criteria(pool.info.current[i,ii], test.gain[ii], scale=scale.power)

# Now we sum across items
sum.score <- data-blogger-escaped-1="1" data-blogger-escaped-apply="apply" data-blogger-escaped-p="p" data-blogger-escaped-pool.score="pool.score" data-blogger-escaped-sum="sum">
# The first item that we pick is the item with the highest sum score
item <- data-blogger-escaped-item="item" data-blogger-escaped-pool.current="pool.current" data-blogger-escaped-sum.score="=max(sum.score)][1]</p"># Placing a [1] ensures that in the unlikely event of a tie, that R picks the first item as the winner.

# We add the information from the item that we have selected to the test.
test.info[1,] <- data-blogger-escaped-item="=item[1],]+0</p" data-blogger-escaped-pool.current="pool.current" data-blogger-escaped-pool.info="pool.info"># Note here we are using the original pool to add information.  This is because the restricted pool would not be targeted correctly with item# as the item number decreases in the available pool.

# Now the tricky part is modifying the item pool to remove the first item so that it is not selected again.
pool.info.current = pool.info.current[pool.current$item!=item[1],]
pool.current = pool.current[pool.current$item!=item[1],]

 plot(theta, info, type="b", main="Test Information Curve to Match", ylab="Information", ylim=c(0,7), lwd=3)
 lines(theta, info-.5, col="red", lwd=3)
 lines(theta, info+.5, col="red", lwd=3)



 lines(theta, test.info[1,], type="b", col=gray(.45), lwd=1.5)

# Now we repeat the above process 19 more times.

for(v in 2:(nitems)) {
  # Generate an matrix to hold the score for the current pool of items
  pool.score <- data-blogger-escaped-p="p" data-blogger-escaped-pool.info.current="pool.info.current">
  # First let's figure out how much we need to gain on the test.
            test.gain = (info-test.info[v-1,])/3
  if (v&gt;9) test.gain = (info-test.info[v-1,])/(16-v)
  if (v&gt;15) test.gain = (info-test.info[v-1,])


  # Calculate the scores for all items
  for(i in 1:nrow(pool.current)) for(ii in 1:nthetas) pool.score[i,ii] <- data-blogger-escaped-nbsp="nbsp" data-blogger-escaped-p="p">                       criteria(pool.info.current[i,ii], test.gain[ii], scale=scale.power)

  # Now we sum across items
  sum.score <- data-blogger-escaped-1="1" data-blogger-escaped-apply="apply" data-blogger-escaped-p="p" data-blogger-escaped-pool.score="pool.score" data-blogger-escaped-sum="sum">
  # The item that we pick is the item with the highest sum score
  item[v] <- data-blogger-escaped-item="item" data-blogger-escaped-pool.current="pool.current" data-blogger-escaped-sum.score="=max(sum.score)]</p">
  # We add the information from the item that we have selected to the test.
  # We are adding our information to the information from the previous items.
  test.info[v,] <- data-blogger-escaped-item="=item[v],]+test.info[v-1,]</p" data-blogger-escaped-pool.current="pool.current" data-blogger-escaped-pool.info.current="pool.info.current">
  # We once again shrink the pool
  pool.info.current = pool.info.current[pool.current$item!=item[v],]
  pool.current = pool.current[pool.current$item!=item[v],]

  lines(theta, test.info[v,], type="b",lwd=1.5)
}

zvalue[zz] = z
zsum[zz] = sum(abs(info-test.info[20,]))
zfail[zz] = sum(abs(info-test.info[20,])&gt;.5)

results[zz,] <- data-blogger-escaped-c="c" data-blogger-escaped-item="item" data-blogger-escaped-p="p" data-blogger-escaped-sort="sort" data-blogger-escaped-zfail="zfail" data-blogger-escaped-zsum="zsum" data-blogger-escaped-zvalue="zvalue" data-blogger-escaped-zz="zz">}



results
results[order(results[,3]),]

# Using this search algorithm I was able to find several solutions.  However, there is no guarantee that this algorithm would have found soluations.  A better algorithm for this type of problem would be one capable of evaluating all of the outcomes simultaneously.  A genetic algorithm might be fitting though I know that linear programming is often used to solve this problem.

# Note, also, we are randomly generating the item pools so different pools of will produce different results.  Control this by setting the seed at the beginning.

Saturday, November 10, 2012

Item Information - IRT


# Interestingly, one of the founders of Item Response Theory (Fredrick Lord) developed his own concept of information.  His approach was both unique and interesting but ended up leaving us with the same Fisher Information Matrix.

# The Fisher Information equation is an equation that measures for any particular value of an estimate, what the particular amount of information you have on that estimate is.  It is a difficult subject that I struggle with.

# It might be useful to look at how it can be found.

# First keep in mind that for Fisher Information we are primarily concerned with Maximum Likelihood.

# For maximum likelihood we are primarily concerned in maximizing the ln(pdf(theta)) by choosing theta.

# Define L = ln(pdf(theta))

# The score is equal to the first derivative of the L with respect to thera.  Define Score = S = dL/dTheta = d ln(pdf(theta))/d theta

# We know that if the MLE maximization routine has worked properly then the score is equal to zero.

# Now the Fisher Information is equal to the expected value of the score squared (the second moment) given theta:

# I = E(S^2|theta)

# Now, the tricky part is thinking how to make sense of this thing we call information.

# First let's think how it is used.

# For one thing the standard error of estimation is se=1/(I)^.5
# In a similar vien, it can be shown that the variance of any unbiased estimator (f) of theta has a lower limit of 1/I: Var(f(theta)) >= 1/I

# Thus we can see that there is a direct inverse relationship between information and the variance of our estimator.

# How do we reconcile this?

# Item Response theory has some very nice applications of information that I believe shed light on other uses of information.

# For the Rasch Model the item information for one item is.  I(theta) = p(theta)(1-p(theta)

# Where p is the probability of getting the item correct.

# Let's map this out for a range of thetas assuming the difficulty parameter is one.

theta = seq(-4,6,.1)

# Now let's define the Rasch function:

rasch = function(theta,b) exp(theta-b)/(1+exp(theta-b))

# Let's also define the information value.

rasch.info = function(theta,b) rasch(theta,b)*(1-rasch(theta,b))

plot(theta, rasch.info(theta,1), ylab = "Information", main="The most information is gained at theta=1 (1PL)", type="l")



# We can see that information peaks at theta=1.  What does this mean for practical purposes?  If you want to know the most about a student's ability give them a question in which the difficulty of the item is at their ability level.  If you though give them a question that is far too easy or far too hard then even if they do well on the question or poor on the question you have not learned that much about their ability level on the theta scale.

# Why?  Because the likelihood of someone else doing equally well on that question who is close on the theta scale is equivalent.

# We can see this by looking at the item characteristic curve:

plot(theta, rasch(theta,1), main="Item Characteristic Curve", ylab="Probability of Correct Response", type="l")



# We can see that the change in the probability of getting the item correct is largest at theta=1.  However, as theta gets very large or very small there is very little change in the probabilities (within each prospective range) of getting the item correct as a result of a small change in theta.

# Another way of thinking about this is: if we had two students in the room and both of them were about the same ability and we wanted to know which was stronger, we would want to give them the question which was most closely aligned at the difficulty related to their ability.

# Let's look at a few more different IRT models.

# The 2 parameter logistic model is very similar to the single parameter:

# I(theta) = a^2*p(theta)*(1-p(theta))

# The only difference exists in that the two parameter model allows for their to be more or less information generated from each item as a result of the "discriminatory" power of the item.  Though it is worth noting that a high a parameter model does not strictly dominate a low a parameter model in terms of information for all values of theta.  This is because the p*(1-p) is also a function of theta.  Let's see this in action:

PL2 = function(theta,a, b) exp(a*(theta-b))/(1+exp(a*(theta-b)))

PL2.info = function(theta, a, b) a^2*PL2(theta,a,b)*(1-PL2(theta,a,b))

plot(theta, PL2.info(theta,2,1), ylab = "Information", main="Larger a creates a larger peak but shallower tails", type="l")

lines(theta, PL2.info(theta,.75,1))



# We can see that we greatly prefer a=2 for theta values between -1 and 3.

# However, the item with less discriminatory power may have more information in the tails.  This is somewhat difficult to understand.  On a test, a good example may be imagine two different questions.  One question, is a arithmatic question for which students must demonstrate knowledge of long division.  The alternative, is a question in which students answer a word problem by piecing together components and understanding the concepts behind the math.  The first question may be a better question for identifying if a student either knows or does not know long division.  However, the second question, may be less good at identifying specific mastery at a particular skill level, but rather demonstrates the ability of the student to pull together various math concepts into a coherent answer.  Thus we might not be able to infer much about arithmatic ability if a student answers this question correctly.  But, any student answering this question correctly whatever, their ability level tells us something about their ability.

# We will also look briefly at the 3 parameter Logistic Model:

# I(theta) = a^2 *(p-c)^2/(1-c)^2 * (1-p)/p

# It can be shown that as c increases, the information function monotonically decreases.  This makes sense in that the guessing parameter c is the opposite of information.  As c gets larger, the likelihood of the student getting the question right by pure chance also gets larger.

PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b)))

PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c)

plot(theta, PL3.info(theta,1.2,1,.4), ylab = "Information", main="3PL information a=1.2, b=1, c=.4", type="l")



# We can see that the 3 parameter logistic model has an asymetric form.  This is the result of the guessing parameter.  The problem with respondents having the ability to guess correctly is that we end up having less information on low ability respondents than we would like.

Wednesday, October 31, 2012

Maximum Likelihood and Information


Maximum likelihood methods can seem complex and daunting and certainly many aspects of the maximum likelihood can be daunting.  However, the general idea behind maximum likelihood is very intuitively appealing and an understanding of the generalities is sufficient for many people who use many maximum likelihood procedures without knowing the formulas behind them.

Maximum Likelihood Methods are methods that use the theoretical probability distribution of outcomes to solve the parameter estimates that maximize the probability of observing the particular outcome observed.  Let’s see this is action.

Imagine we can observe 8 potential test outcomes for a person from a test (100, 200, 300, 400, 500, 600, 700, 800).  The test outcomes has a conditional probability of occurring based on the characteristics of the person (theta).  We can observe the total test score for the person but we cannot observe the theta.

The probability of each outcome occurring can be read from the following table.
Table 1:

Score
100200300400500600700800Total
Theta-40.600.250.100.050.000.000.000.001
-30.300.500.150.050.000.000.000.001
-20.200.300.400.050.050.000.000.001
-10.100.200.300.200.100.050.050.001
00.050.100.150.200.200.150.100.051
10.000.050.100.250.300.150.100.051
20.000.000.050.150.200.300.150.151
30.000.000.050.050.150.200.300.251
40.000.000.000.050.100.250.300.301
Total1.251.41.31.051.11.110.8
Probability0.140.160.140.120.120.120.110.09



We read this conditional probability table horizontally.  That is P(T=100|theta=-4) = 60% or P(T=500|theta=4) = 10%.  Horizontally the probabilities must sum to 1 but vertically they need not.  We can interpret the vertical summing as a density measure representing the relative likelihood of observing that score if the theta's are distributed uniformly P(theta=THETA)= 1/9 given THETA={-4,-3,...3,4}.  That I mean to say by the previous notation is that the probability that any random draw of theta equals a particular draw of theta is 1/9.

Thus, given that ability is uniformly drawn, the bottom most row in the table is the probability of observing that particular score.

So what does this have to do with maximum likelihood?  Imagine that we know the information from Table 1 and we see a particular outcome T.  Can we calculate the probability that the person has a particular theta value?  Yes!

Imagine that T=100.  From the table we should be able to see that the most likely theta value is -4.  But what is the exact probability?  It is the probability of the outcome occuring if theta is -4 over the sum of the probability of the outcome occurring (Bayes theorem P(theta=-4|T=100)=P(T=100|theta=-4)/sum(across all THETAS of P(T=100|theta=THETA).

Thus:

P(theta=-4|T=100)= .6/1.25 = 48%

In other words.  Given an observed score of 100, the probability that the person has a theta=-4 is 48%. 

We can construct a new table with conditional probabilities differing based instead conditional probabilities of observing a particular theta value given a score value.

Table 2
Score
100200300400500600700800Total
Theta-40.480.180.080.050.000.000.000.000.78
-30.240.360.120.050.000.000.000.000.76
-20.160.210.310.050.050.000.000.000.78
-10.080.140.230.190.090.050.050.000.83
00.040.070.120.190.180.140.100.060.90
10.000.040.080.240.270.140.100.060.92
20.000.000.040.140.180.270.150.190.97
30.000.000.040.050.140.180.300.311.02
40.000.000.000.050.090.230.300.381.04
Total11111111
We can see that the Table 2 is somewhat adjusted from Table 1 but generally not hugely.  This is not a rule.  If there were many more categories of theta then it is likely the adjustment would be more dramatic.

So, in this example a maximum likelihood estimator would choose eight different expected values for theta for each score observed.  Let's define M as the solution to the maximum likelihood problem.  From Table 2 all we need do is read the highest probability from each column.

M(T=100) = -4
M(T=200) = -3
M(T=300) = -2

The maximum likelihood estimator need not peek at every potential theta value.  In this case the maximum likelihood estimator would jump from -2 to 1.  This is somewhat an artifact of the discrete nature of this setup.  If theta and the score were continuous then it is less likely some values of theta would be skipped.
M(T=400) = 1
M(T=500) = 1
M(T=600) = 2

M(T=700) = 3 or 4
The maximum likelihood estimator for most maximization problems needs to have a single peak.  This table would be hard to maximize across for many maximization algorithms.  This is not really a problem because this table is somewhat contrived.
M(T=800) = 4

Thus, this table illustrates some of the common problems with maximum likelihood.  Some values of the parameter are hard to identify (ie. T=0) while some problems have "flat" spots to be maximized over that cause the algorithm not to converge.

When looking at Table 2 think not just on the peaks but also on the "Information" that you have by observing particular test scores.  In other words.  How much information do you get from knowing a particular test score?  If for instance you knew that T=100 then you would know your most likely theta=-4 and that the theta has a 96% chance (48+24+16+8) of being between -1 and -4.  This can be thought of as the 96% confidence interval.  If however you have a T=400 you know that your most likely theta=1 but only that you have a 95% chance that your theta is between (-3 and 4).  This is a pretty wide confidence interval on your estimate.  Thus we can see that some test values have more "information" than other test values.

Let's imagine testing a hypothesis Table 2:
H0: theta=-4 alpha=.05
Observe:
T=100 fail to reject
T=200 fail
T=300 fail
T=400 reject
T>400 reject

Thus we have enough information from this test to potentially reject the null when H0:theta=-4.  If however, the null was H0: theta=0 then only in the event T=100 could we reject the null at a 5% level and T=800 at a 10% level.

I hope this discussion is useful.  I certainly found it useful to think through.