Thursday, December 1, 2016

Efficiently Saving and Sharing Data in R

After spending a day the other week struggling to make sense of a federal data set shared in an archaic format (ASCII fixed format dat file).

It is essential for the effective distribution and sharing of data that it use the minimum amount of disk space and be rapidly accessible for use by potential users.

In this post I test four different file formats available to R users. These formats are comma separated values csv (write.csv()), object representation format as a ASCII txt (dput()), a serialized R object (saveRDS()), and a Stata file (write.dta() from the foreign package). For reference, rds files seem to be identical to Rdata files except that they deal with only one object rather than potentially multiple.

In order to get an idea of how and where different formats outperformed each other I simulated a dataset composed of different common data formats. These formats were the following:

Numeric Formats

  • Index 1 to N - ex. 1,2,3,4,...
  • Whole Numbers - ex. 30, 81, 73, 5, ...
  • Big Numbers - ex. 36374.989943, 15280.050850, 5.908210, 79.890601, 2.857904, ...
  • Continous Numbers - ex. 1.1681155, 1.6963295 0.8964436, -0.5227753, ...

Text Formats

  • String coded factor variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd ...
  • String coded factor variables with 16 characters coded as strings
  • String coded factor variables with 64 characters coded as strings
  • Factor coded variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd - coded as 1,2,4,3,2, ...
  • Factor coded variables with 16 characters
  • Factor coded variables with 64 characters
  • String variables with random 4 characters - ex. jdhd, jdjj, ienz, lsdk, ...
  • String variables with random 16 characters
  • String variables with random 64 characters

What type of format a variable is in is a predictive characteristic for how much space that variable takes up and therefore how time consuming that variable is to read or write. For variables that are easy to describe they tend to take up little space. An index variable in an extreme example and can take up almost no space as it can be expressed in an extremely compact format (1:N).

In contrast numbers which are very long or have a great degree of precision tend to have more information and therefore take more resources to access and store. String variables when filled with truly random or unique responses are some of the hardest data to compress as each value may be sampled from the full character spectrum. There is some significant potential for compression when strings are repeated in the variable. These repetitive entries can be either coded as a "factor" variable or a string variable in R.

As part of this exploration, I look at how string data is stored and saved when coded as either a string or as a factor within R.

Raw Files Let's first look at space taken when saving uncompressed files.

Figure 1: File Size

File Size

Figure 1 shows the file size of each of the saved variables when 10,000 observations are generated. The dataframe object is the data.frame composed off all of the variables. From the height of the dataframe, we can see that rds is overall the winner. Looking at the other variable values we can see only that csv appear to consistently underperform for most file formats except for random strings.

Figure 2: File Size Log scaled

File Size Logged

In Figure 2 we can see that rds is consistently outperforming all of the other formats with the one exception of index in which the txt encoding simply reads 1:10000. Apparently even serializing to bytes can't beat that.

Interestingly, there does not appear to be a effective size difference between repetitive strings encoded as factors accounting for the size of the strings (4, 16, or 64). We can see that the inability of csv to compress factor strings dramatically penalizes the efficiency of csv relative to the other formats.

File Compression But data is rarely shared in uncompressed formats. How does compression change things?

Figure 3: Zipped File Sizes Logged

File Size Zipped Logged

We can see from Figure 3 that if we zip our data after saving, the file size can do pretty much as well as rds. Comma delineated csv files are a bit of an exception with factor variables suffering under csv. Yet random strings perform slightly better under csv than other formats. Interesting rds files seem slightly larger than the other two file types. Overall though, it is pretty hard to see any significant difference in file size based on format after zipping.

So, should we stick with whatever format we prefer?

Not so fast. Sure, all of the files are similarly sized after zipping. This is useful for sharing files. But having to keep large file sizes on a hard drive is not ideal even if they can be compressed for distribution. There is finite space on any system and some files can be in the hundreds of MB to hundreds of GB range. Dealing with file formats and multiple file versions which are this large can easily drain the permanent storage capacity of most systems.

But an equally important concern, is how long it takes write and read different file formats.

Reading Speed In order to test reading speeds, I loaded each of the different full dataframe files fifty times. I also tested how long it would take to unzip then load that file.

Figure 4: Reading and unzipping average speeds

File Size Zipped Logged

From Figure 4, we can see that read speeds roughly correspond with the size of files. We can see that even a relatively small file (30 MB csv file) can take as long as 7 seconds to open. Working with large files saved in an inefficient format can be very frustrating.

In contrast, saving files in efficient formats can dramatically cut down on the time taken opening those files. Using the most efficient format (rds), files could be 100 times larger than those used in this simulation and still open in less than a minute.

Conclusions Finding common file formats that any software can access is not easy. As a result many public data sets are provided in archaic formats which are poorly suited for end users.

This results in a wide pool of software sweets having the ability to access these datasets. However, with inefficient file formats comes a higher demand on the hardware of end users. I am unlikely to be the only person struggling with opening some of these large "public access" datasets.

Those maintaining these datasets will argue that sticking with the standard, inefficient format is the best of bad options. However, there is no reason they could not post datasets in rds formats in addition to the outdated formats they currently exist in.

And no we need not argue that selecting one software language to save data in will be biased toward those languages. Already many federal databases come with code supplements in Stata, SAS, or SPSS. To access these supplements, one is required to have paid access to that software.

Yet, R is free and its database format is public domain. Any user could download R, open a rds or Rdata file, then save that file in a format more suited to their purposes. None of these other proprietary database formats can boast the same.

Efficiently Saving and Sharing Data in R

After spending a day the other week struggling to make sense of a federal data set shared in an archaic format (ASCII fixed format dat file).

It is essential for the effective distribution and sharing of data that it use the minimum amount of disk space and be rapidly accessible for use by potential users.

In this post I test four different file formats available to R users. These formats are comma separated values csv (write.csv()), object representation format as a ASCII txt (dput()), a serialized R object (saveRDS()), and a Stata file (write.dta() from the foreign package). For reference, rds files seem to be identical to Rdata files except that they deal with only one object rather than potentially multiple.

In order to get an idea of how and where different formats outperformed each other I simulated a dataset composed of different common data formats. These formats were the following:

Numeric Formats

  • Index 1 to N - ex. 1,2,3,4,...
  • Whole Numbers - ex. 30, 81, 73, 5, ...
  • Big Numbers - ex. 36374.989943, 15280.050850, 5.908210, 79.890601, 2.857904, ...
  • Continous Numbers - ex. 1.1681155, 1.6963295 0.8964436, -0.5227753, ...

Text Formats

  • String coded factor variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd ...
  • String coded factor variables with 16 characters coded as strings
  • String coded factor variables with 64 characters coded as strings
  • Factor coded variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd - coded as 1,2,4,3,2, ...
  • Factor coded variables with 16 characters
  • Factor coded variables with 64 characters
  • String variables with random 4 characters - ex. jdhd, jdjj, ienz, lsdk, ...
  • String variables with random 16 characters
  • String variables with random 64 characters

What type of format a variable is in is a predictive characteristic for how much space that variable takes up and therefore how time consuming that variable is to read or write. For variables that are easy to describe they tend to take up little space. An index variable in an extreme example and can take up almost no space as it can be expressed in an extremely compact format (1:N).

In contrast numbers which are very long or have a great degree of precision tend to have more information and therefore take more resources to access and store. String variables when filled with truly random or unique responses are some of the hardest data to compress as each value may be sampled from the full character spectrum. There is some significant potential for compression when strings are repeated in the variable. These repetitive entries can be either coded as a "factor" variable or a string variable in R.

As part of this exploration, I look at how string data is stored and saved when coded as either a string or as a factor within R.

Raw Files Let's first look at space taken when saving uncompressed files.

Figure 1: File Size SizeUnzipped.png

Figure 1 shows the file size of each of the saved variables when 10,000 observations are generated. The dataframe object is the data.frame composed off all of the variables. From the height of the dataframe, we can see that rds is overall the winner. Looking at the other variable values we can see only that csv appear to consistently underperform for most file formats except for random strings.

Figure 2: File Size Log scaled SizeUnzippedlog10.png In Figure 2 we can see that rds is consistently outperforming all of the other formats with the one exception of index in which the txt encoding simply reads 1:10000. Apparently even serializing to bytes can't beat that.

Interestingly, there does not appear to be a effective size difference between repetitive strings encoded as factors accounting for the size of the strings (4, 16, or 64). We can see that the inability of csv to compress factor strings dramatically penalizes the efficiency of csv relative to the other formats.

File Compression But data is rarely shared in uncompressed formats. How does compression change things?

Figure 3: Zipped File Sizes Logged SizeZippedlog10.png We can see from Figure 3 that if we zip our data after saving, the file size can do pretty much as well as rds. Comma delineated csv files are a bit of an exception with factor variables suffering under csv. Yet random strings perform slightly better under csv than other formats. Interesting rds files seem slightly larger than the other two file types. Overall though, it is pretty hard to see any significant difference in file size based on format after zipping.

So, should we stick with whatever format we prefer?

Not so fast. Sure, all of the files are similarly sized after zipping. This is useful for sharing files. But having to keep large file sizes on a hard drive is not ideal even if they can be compressed for distribution. There is finite space on any system and some files can be in the hundreds of MB to hundreds of GB range. Dealing with file formats and multiple file versions which are this large can easily drain the permanent storage capacity of most systems.

But an equally important concern, is how long it takes write and read different file formats.

Reading Speed In order to test reading speeds, I loaded each of the different full dataframe files fifty times. I also tested how long it would take to unzip then load that file.

Figure 4: Reading and unzipping average speeds SaveSpeed.png From Figure 4, we can see that read speeds roughly correspond with the size of files. We can see that even a relatively small file (30 MB csv file) can take as long as 7 seconds to open. Working with large files saved in an inefficient format can be very frustrating.

In contrast, saving files in efficient formats can dramatically cut down on the time taken opening those files. Using the most efficient format (rds), files could be 100 times larger than those used in this simulation and still open in less than a minute.

Conclusions Finding common file formats that any software can access is not easy. As a result many public data sets are provided in archaic formats which are poorly suited for end users.

This results in a wide pool of software sweets having the ability to access these datasets. However, with inefficient file formats comes a higher demand on the hardware of end users. I am unlikely to be the only person struggling with opening some of these large "public access" datasets.

Those maintaining these datasets will argue that sticking with the standard, inefficient format is the best of bad options. However, there is no reason they could not post datasets in rds formats in addition to the outdated formats they currently exist in.

And no we need not argue that selecting one software language to save data in will be biased toward those languages. Already many federal databases come with code supplements in Stata, SAS, or SPSS. To access these supplements, one is required to have paid access to that software.

Yet, R is free and its database format is public domain. Any user could download R, open a rds or Rdata file, then save that file in a format more suited to their purposes. None of these other proprietary database formats can boast the same.

Thursday, November 3, 2016

Subjective Reprehensibility Scale

Subjective Reprehensibility Scale

Intro

Dear User,

We are conducting a public assessment of the perceived relative morality of various actions. It is hoped that such an assessment will inform the discussion of the relative morality of the high profile actions of public figures. The hope is that such tasks will be both stimulating and manageable.

By submitting your responses, you are agreeing to allow the responses to be analyzed and published anonymously on EconometricsBySimulation.com.

Thank you for your participation, EBS


Instructions

Morality Ranking

Each of the following items lists four statements which some consider immoral or unethical. Rank each statement in terms of degree of morally offensiveness with 1 Least Offensive, 2 Less Offensive, 3 More Offensive, 4 Most Offensive.

Non-Offensive Actions

If you believe an action is not immoral or unethical still rank it but also check the “Not offensive” box.

Nonsense Responses

If you do not understand a statement, please rank it but also check the “Nonsense” box.

Optional Open Ended Response

If you have been a victim of any of the actions please list the action number and any additional information you would like to say about how that action affected you. Please be sure to not include any personally identifiable information.


Rejection Rules

Completion

All 8 items must be completed with each statement ranked.

Ground Truth

Four of the eight items have been previous calibrated with expected response patterns. If three out of these four items submitted demonstrate a response pattern significantly different from that of the expected pattern, the HIT will be rejected. The probability of someone applying effort and missing three items is 6.7%. Therefore, there is a chance of false rejection. If you believe you have been incorrectly rejected, please contact the requester and your HIT will be manually reviewed.

Test

Subjective Reprehensibility Scale

Intro

Dear User,

We are conducting a public assessment of the perceived relative morality of various actions. It is hoped that such an assessment will inform the discussion of the relative morality of the high profile actions of public figures. The hope is that such tasks will be both stimulating and manageable.

By submitting your responses, you are agreeing to allow the responses to be analyzed and published anonymously on EconometricsBySimulation.com.

Thank you for your participation, EBS


Instructions

Morality Ranking Each of the following items lists four statements which some consider immoral or unethical. Rank each statement in terms of degree of morally offensiveness with 1 Least Offensive, 2 Less Offensive, 3 More Offensive, 4 Most Offensive.

Non-Offensive Actions If you believe an action is not immoral or unethical still rank it but also check the “Not offensive” box.

Nonsense Responses If you do not understand a statement, please rank it but also check the “Nonsense” box.

Optional Open Ended Response If you have been a victim of any of the actions please list the action number and any additional information you would like to say about how that action affected you. Please be sure to not include any personally identifiable information.


Rejection Rules

Completion All 8 items must be completed with each statement ranked.

Ground Truth Four of the eight items have been previous calibrated with expected response patterns. If three out of these four items submitted demonstrate a response pattern significantly different from that of the expected pattern, the HIT will be rejected. The probability of someone applying effort and missing three items is 6.7%. Therefore, there is a chance of false rejection. If you believe you have been incorrectly rejected, please contact the requester and your HIT will be manually reviewed.

Sunday, November 11, 2012

Matching Test Information

# Imagine that you are asked to create a test from an item pool that you have.

# The item pool has 100 items and you have to create a test of 20 items in length.

# Easy right?

# The trick is that the test creators want the information on the test to match that of previous tests.

# They give you points on the information curve that your test must approximate in terms of information.  These points are given for every half of a point change in theta.

info  = c(0, .05, .1,  .5, 1.4, 3.5, 5.8,  6, 6.25,  6, 5.8, 5.5,4.5,  3,1.75, .5 , .25)
theta = c(-4, -3.5, -3,-2.5, -2 ,-1.5, -1 ,-.5,    0, .5,   1, 1.5,  2,2.5,   3, 3.5,   4)

plot(theta, info, type="b", main="Test Information Curve to Match", ylab="Information", ylim=c(0,7), lwd=3)

 lines(theta, info-.5, col="red", lwd=3)
 lines(theta, info+.5, col="red", lwd=3)

 # You are allowed a .5 above or below movement in actual information.




# You are given a set of 100 items.  I will randomly generate items for a 3PL model, however your items would be given to you.

npool = 100

pool <- data-blogger-escaped-a="abs(rnorm(npool)*.25+1.1)," data-blogger-escaped-b="rnorm(npool)," data-blogger-escaped-c="abs(rnorm(npool)/7.5+.1))</p" data-blogger-escaped-cbind="cbind" data-blogger-escaped-item="1:npool,">summary(pool)


# You have to choose 20 items from these 100 to construct a test which has information that best matches the target information function.

nitems = 20

# Looking good.

# Each item has a item characteristic curve (ICC) of:
PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b)))

# and information function defined as:
PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c)

# There are many potential ways of solving this problem.  I propose to devise a maximization routine that evaluates each potential item and how it helps to shrink the difference between the target test information and that of the current.  Each item will be added individually until the test has 20 items.

# First I want to calculate the information at each theta for each item.
nthetas = length(theta)

# This matrix will contain the item pool information values
pool.info <- data-blogger-escaped-matrix="matrix" data-blogger-escaped-ncol="nthetas)</p" data-blogger-escaped-nrow="npool,">

colnames(pool.info) <- data-blogger-escaped-heta=",theta,sep=" data-blogger-escaped-p="p" data-blogger-escaped-paste="paste">
# These must be calculated for each item but for all thetas simultaneously.
for(i in 1:npool) {
  pool.info[i,] <- data-blogger-escaped-i="i" data-blogger-escaped-p="p" data-blogger-escaped-pl3.info="pl3.info" data-blogger-escaped-pool="pool" data-blogger-escaped-theta="theta">}


criteria <- data-blogger-escaped-function="function" data-blogger-escaped-item.info="item.info" data-blogger-escaped-p="p" data-blogger-escaped-scale="1)" data-blogger-escaped-test.gain="test.gain">  # test.gain is the potential information that the test still needs
  # If this information is greater than 0 then this item can add as much as test.gain to the overall test information.
  if (test.gain&gt;0) return_score=min(test.gain^scale, item.info^scale)
  # If the test has nothing to gain, then this item will add nothing.
  if (test.gain&lt;=0) return_score= 0
  return_score
}

criteria <- data-blogger-escaped-function="function" data-blogger-escaped-item.info="item.info" data-blogger-escaped-p="p" data-blogger-escaped-scale="1)" data-blogger-escaped-test.gain="test.gain">  if (test.gain-item.info&gt;0) return_score=test.gain^scale
  if (test.gain-item.info<0 data-blogger-escaped-return_score="max(test.gain,0)^scale-2*(item.info-test.gain)^scale</p">  return_score
}

# I generated several different sets of criteria for evaluating each item.

# I found the second criteria to be the best because it gives weight to gains approaching the target score but also give double that weight in gains above the target score.  This is useful because information can be added but the maximization routine does not subtract.

# I have added a z loop into the program to have the program scan over multiple scale factors for solutions.
zz = zfail = zsum = zvalue=0
zseq = seq(.25,15,.10)
# This is a fairly fine gradient to search over.

results <- data-blogger-escaped-matrix="matrix" data-blogger-escaped-ncol="nitems+3," data-blogger-escaped-nrow="length(zseq))</p">colnames(results) <- data-blogger-escaped-ails="ails" data-blogger-escaped-c="c" data-blogger-escaped-ifference_sum="ifference_sum" data-blogger-escaped-nitems="nitems" data-blogger-escaped-p="p" data-blogger-escaped-paste="paste" data-blogger-escaped-sep="" data-blogger-escaped-value="value">

for (z in zseq) {
zz=zz+1

# Let us create the empty test pool.
test.info <- data-blogger-escaped-matrix="matrix" data-blogger-escaped-ncol="nthetas)</p" data-blogger-escaped-nrow="nitems,">colnames(test.info) <- data-blogger-escaped-c="c" data-blogger-escaped-nfot=",theta, sep=" data-blogger-escaped-p="p" data-blogger-escaped-paste="paste">

# The current copies start out as copies of the pool and the pool's information but they shrink as more items are selected.
pool.current = pool
pool.info.current = pool.info

# Now we are ready to think about starting to select items for our test.
# Initially we have no information.  We would like to select the item that satisfies our criteria for best item.

scale.power = z

# Generate an matrix to hold the score for the current pool of items
pool.score <- data-blogger-escaped-p="p" data-blogger-escaped-pool.info.current="pool.info.current">
# First let's figure out how much we need to gain on the test.
test.gain = (info-test.info[1,])

# Calculate the scores for all items
for(i in 1:nrow(pool.current)) for(ii in 1:nthetas) pool.score[i,ii] <- data-blogger-escaped-nbsp="nbsp" data-blogger-escaped-p="p">            criteria(pool.info.current[i,ii], test.gain[ii], scale=scale.power)

# Now we sum across items
sum.score <- data-blogger-escaped-1="1" data-blogger-escaped-apply="apply" data-blogger-escaped-p="p" data-blogger-escaped-pool.score="pool.score" data-blogger-escaped-sum="sum">
# The first item that we pick is the item with the highest sum score
item <- data-blogger-escaped-item="item" data-blogger-escaped-pool.current="pool.current" data-blogger-escaped-sum.score="=max(sum.score)][1]</p"># Placing a [1] ensures that in the unlikely event of a tie, that R picks the first item as the winner.

# We add the information from the item that we have selected to the test.
test.info[1,] <- data-blogger-escaped-item="=item[1],]+0</p" data-blogger-escaped-pool.current="pool.current" data-blogger-escaped-pool.info="pool.info"># Note here we are using the original pool to add information.  This is because the restricted pool would not be targeted correctly with item# as the item number decreases in the available pool.

# Now the tricky part is modifying the item pool to remove the first item so that it is not selected again.
pool.info.current = pool.info.current[pool.current$item!=item[1],]
pool.current = pool.current[pool.current$item!=item[1],]

 plot(theta, info, type="b", main="Test Information Curve to Match", ylab="Information", ylim=c(0,7), lwd=3)
 lines(theta, info-.5, col="red", lwd=3)
 lines(theta, info+.5, col="red", lwd=3)



 lines(theta, test.info[1,], type="b", col=gray(.45), lwd=1.5)

# Now we repeat the above process 19 more times.

for(v in 2:(nitems)) {
  # Generate an matrix to hold the score for the current pool of items
  pool.score <- data-blogger-escaped-p="p" data-blogger-escaped-pool.info.current="pool.info.current">
  # First let's figure out how much we need to gain on the test.
            test.gain = (info-test.info[v-1,])/3
  if (v&gt;9) test.gain = (info-test.info[v-1,])/(16-v)
  if (v&gt;15) test.gain = (info-test.info[v-1,])


  # Calculate the scores for all items
  for(i in 1:nrow(pool.current)) for(ii in 1:nthetas) pool.score[i,ii] <- data-blogger-escaped-nbsp="nbsp" data-blogger-escaped-p="p">                       criteria(pool.info.current[i,ii], test.gain[ii], scale=scale.power)

  # Now we sum across items
  sum.score <- data-blogger-escaped-1="1" data-blogger-escaped-apply="apply" data-blogger-escaped-p="p" data-blogger-escaped-pool.score="pool.score" data-blogger-escaped-sum="sum">
  # The item that we pick is the item with the highest sum score
  item[v] <- data-blogger-escaped-item="item" data-blogger-escaped-pool.current="pool.current" data-blogger-escaped-sum.score="=max(sum.score)]</p">
  # We add the information from the item that we have selected to the test.
  # We are adding our information to the information from the previous items.
  test.info[v,] <- data-blogger-escaped-item="=item[v],]+test.info[v-1,]</p" data-blogger-escaped-pool.current="pool.current" data-blogger-escaped-pool.info.current="pool.info.current">
  # We once again shrink the pool
  pool.info.current = pool.info.current[pool.current$item!=item[v],]
  pool.current = pool.current[pool.current$item!=item[v],]

  lines(theta, test.info[v,], type="b",lwd=1.5)
}

zvalue[zz] = z
zsum[zz] = sum(abs(info-test.info[20,]))
zfail[zz] = sum(abs(info-test.info[20,])&gt;.5)

results[zz,] <- data-blogger-escaped-c="c" data-blogger-escaped-item="item" data-blogger-escaped-p="p" data-blogger-escaped-sort="sort" data-blogger-escaped-zfail="zfail" data-blogger-escaped-zsum="zsum" data-blogger-escaped-zvalue="zvalue" data-blogger-escaped-zz="zz">}



results
results[order(results[,3]),]

# Using this search algorithm I was able to find several solutions.  However, there is no guarantee that this algorithm would have found soluations.  A better algorithm for this type of problem would be one capable of evaluating all of the outcomes simultaneously.  A genetic algorithm might be fitting though I know that linear programming is often used to solve this problem.

# Note, also, we are randomly generating the item pools so different pools of will produce different results.  Control this by setting the seed at the beginning.

Saturday, November 10, 2012

Item Information - IRT


# Interestingly, one of the founders of Item Response Theory (Fredrick Lord) developed his own concept of information.  His approach was both unique and interesting but ended up leaving us with the same Fisher Information Matrix.

# The Fisher Information equation is an equation that measures for any particular value of an estimate, what the particular amount of information you have on that estimate is.  It is a difficult subject that I struggle with.

# It might be useful to look at how it can be found.

# First keep in mind that for Fisher Information we are primarily concerned with Maximum Likelihood.

# For maximum likelihood we are primarily concerned in maximizing the ln(pdf(theta)) by choosing theta.

# Define L = ln(pdf(theta))

# The score is equal to the first derivative of the L with respect to thera.  Define Score = S = dL/dTheta = d ln(pdf(theta))/d theta

# We know that if the MLE maximization routine has worked properly then the score is equal to zero.

# Now the Fisher Information is equal to the expected value of the score squared (the second moment) given theta:

# I = E(S^2|theta)

# Now, the tricky part is thinking how to make sense of this thing we call information.

# First let's think how it is used.

# For one thing the standard error of estimation is se=1/(I)^.5
# In a similar vien, it can be shown that the variance of any unbiased estimator (f) of theta has a lower limit of 1/I: Var(f(theta)) >= 1/I

# Thus we can see that there is a direct inverse relationship between information and the variance of our estimator.

# How do we reconcile this?

# Item Response theory has some very nice applications of information that I believe shed light on other uses of information.

# For the Rasch Model the item information for one item is.  I(theta) = p(theta)(1-p(theta)

# Where p is the probability of getting the item correct.

# Let's map this out for a range of thetas assuming the difficulty parameter is one.

theta = seq(-4,6,.1)

# Now let's define the Rasch function:

rasch = function(theta,b) exp(theta-b)/(1+exp(theta-b))

# Let's also define the information value.

rasch.info = function(theta,b) rasch(theta,b)*(1-rasch(theta,b))

plot(theta, rasch.info(theta,1), ylab = "Information", main="The most information is gained at theta=1 (1PL)", type="l")



# We can see that information peaks at theta=1.  What does this mean for practical purposes?  If you want to know the most about a student's ability give them a question in which the difficulty of the item is at their ability level.  If you though give them a question that is far too easy or far too hard then even if they do well on the question or poor on the question you have not learned that much about their ability level on the theta scale.

# Why?  Because the likelihood of someone else doing equally well on that question who is close on the theta scale is equivalent.

# We can see this by looking at the item characteristic curve:

plot(theta, rasch(theta,1), main="Item Characteristic Curve", ylab="Probability of Correct Response", type="l")



# We can see that the change in the probability of getting the item correct is largest at theta=1.  However, as theta gets very large or very small there is very little change in the probabilities (within each prospective range) of getting the item correct as a result of a small change in theta.

# Another way of thinking about this is: if we had two students in the room and both of them were about the same ability and we wanted to know which was stronger, we would want to give them the question which was most closely aligned at the difficulty related to their ability.

# Let's look at a few more different IRT models.

# The 2 parameter logistic model is very similar to the single parameter:

# I(theta) = a^2*p(theta)*(1-p(theta))

# The only difference exists in that the two parameter model allows for their to be more or less information generated from each item as a result of the "discriminatory" power of the item.  Though it is worth noting that a high a parameter model does not strictly dominate a low a parameter model in terms of information for all values of theta.  This is because the p*(1-p) is also a function of theta.  Let's see this in action:

PL2 = function(theta,a, b) exp(a*(theta-b))/(1+exp(a*(theta-b)))

PL2.info = function(theta, a, b) a^2*PL2(theta,a,b)*(1-PL2(theta,a,b))

plot(theta, PL2.info(theta,2,1), ylab = "Information", main="Larger a creates a larger peak but shallower tails", type="l")

lines(theta, PL2.info(theta,.75,1))



# We can see that we greatly prefer a=2 for theta values between -1 and 3.

# However, the item with less discriminatory power may have more information in the tails.  This is somewhat difficult to understand.  On a test, a good example may be imagine two different questions.  One question, is a arithmatic question for which students must demonstrate knowledge of long division.  The alternative, is a question in which students answer a word problem by piecing together components and understanding the concepts behind the math.  The first question may be a better question for identifying if a student either knows or does not know long division.  However, the second question, may be less good at identifying specific mastery at a particular skill level, but rather demonstrates the ability of the student to pull together various math concepts into a coherent answer.  Thus we might not be able to infer much about arithmatic ability if a student answers this question correctly.  But, any student answering this question correctly whatever, their ability level tells us something about their ability.

# We will also look briefly at the 3 parameter Logistic Model:

# I(theta) = a^2 *(p-c)^2/(1-c)^2 * (1-p)/p

# It can be shown that as c increases, the information function monotonically decreases.  This makes sense in that the guessing parameter c is the opposite of information.  As c gets larger, the likelihood of the student getting the question right by pure chance also gets larger.

PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b)))

PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c)

plot(theta, PL3.info(theta,1.2,1,.4), ylab = "Information", main="3PL information a=1.2, b=1, c=.4", type="l")



# We can see that the 3 parameter logistic model has an asymetric form.  This is the result of the guessing parameter.  The problem with respondents having the ability to guess correctly is that we end up having less information on low ability respondents than we would like.

Wednesday, October 31, 2012

Maximum Likelihood and Information


Maximum likelihood methods can seem complex and daunting and certainly many aspects of the maximum likelihood can be daunting.  However, the general idea behind maximum likelihood is very intuitively appealing and an understanding of the generalities is sufficient for many people who use many maximum likelihood procedures without knowing the formulas behind them.

Maximum Likelihood Methods are methods that use the theoretical probability distribution of outcomes to solve the parameter estimates that maximize the probability of observing the particular outcome observed.  Let’s see this is action.

Imagine we can observe 8 potential test outcomes for a person from a test (100, 200, 300, 400, 500, 600, 700, 800).  The test outcomes has a conditional probability of occurring based on the characteristics of the person (theta).  We can observe the total test score for the person but we cannot observe the theta.

The probability of each outcome occurring can be read from the following table.
Table 1:

Score
100200300400500600700800Total
Theta-40.600.250.100.050.000.000.000.001
-30.300.500.150.050.000.000.000.001
-20.200.300.400.050.050.000.000.001
-10.100.200.300.200.100.050.050.001
00.050.100.150.200.200.150.100.051
10.000.050.100.250.300.150.100.051
20.000.000.050.150.200.300.150.151
30.000.000.050.050.150.200.300.251
40.000.000.000.050.100.250.300.301
Total1.251.41.31.051.11.110.8
Probability0.140.160.140.120.120.120.110.09



We read this conditional probability table horizontally.  That is P(T=100|theta=-4) = 60% or P(T=500|theta=4) = 10%.  Horizontally the probabilities must sum to 1 but vertically they need not.  We can interpret the vertical summing as a density measure representing the relative likelihood of observing that score if the theta's are distributed uniformly P(theta=THETA)= 1/9 given THETA={-4,-3,...3,4}.  That I mean to say by the previous notation is that the probability that any random draw of theta equals a particular draw of theta is 1/9.

Thus, given that ability is uniformly drawn, the bottom most row in the table is the probability of observing that particular score.

So what does this have to do with maximum likelihood?  Imagine that we know the information from Table 1 and we see a particular outcome T.  Can we calculate the probability that the person has a particular theta value?  Yes!

Imagine that T=100.  From the table we should be able to see that the most likely theta value is -4.  But what is the exact probability?  It is the probability of the outcome occuring if theta is -4 over the sum of the probability of the outcome occurring (Bayes theorem P(theta=-4|T=100)=P(T=100|theta=-4)/sum(across all THETAS of P(T=100|theta=THETA).

Thus:

P(theta=-4|T=100)= .6/1.25 = 48%

In other words.  Given an observed score of 100, the probability that the person has a theta=-4 is 48%. 

We can construct a new table with conditional probabilities differing based instead conditional probabilities of observing a particular theta value given a score value.

Table 2
Score
100200300400500600700800Total
Theta-40.480.180.080.050.000.000.000.000.78
-30.240.360.120.050.000.000.000.000.76
-20.160.210.310.050.050.000.000.000.78
-10.080.140.230.190.090.050.050.000.83
00.040.070.120.190.180.140.100.060.90
10.000.040.080.240.270.140.100.060.92
20.000.000.040.140.180.270.150.190.97
30.000.000.040.050.140.180.300.311.02
40.000.000.000.050.090.230.300.381.04
Total11111111
We can see that the Table 2 is somewhat adjusted from Table 1 but generally not hugely.  This is not a rule.  If there were many more categories of theta then it is likely the adjustment would be more dramatic.

So, in this example a maximum likelihood estimator would choose eight different expected values for theta for each score observed.  Let's define M as the solution to the maximum likelihood problem.  From Table 2 all we need do is read the highest probability from each column.

M(T=100) = -4
M(T=200) = -3
M(T=300) = -2

The maximum likelihood estimator need not peek at every potential theta value.  In this case the maximum likelihood estimator would jump from -2 to 1.  This is somewhat an artifact of the discrete nature of this setup.  If theta and the score were continuous then it is less likely some values of theta would be skipped.
M(T=400) = 1
M(T=500) = 1
M(T=600) = 2

M(T=700) = 3 or 4
The maximum likelihood estimator for most maximization problems needs to have a single peak.  This table would be hard to maximize across for many maximization algorithms.  This is not really a problem because this table is somewhat contrived.
M(T=800) = 4

Thus, this table illustrates some of the common problems with maximum likelihood.  Some values of the parameter are hard to identify (ie. T=0) while some problems have "flat" spots to be maximized over that cause the algorithm not to converge.

When looking at Table 2 think not just on the peaks but also on the "Information" that you have by observing particular test scores.  In other words.  How much information do you get from knowing a particular test score?  If for instance you knew that T=100 then you would know your most likely theta=-4 and that the theta has a 96% chance (48+24+16+8) of being between -1 and -4.  This can be thought of as the 96% confidence interval.  If however you have a T=400 you know that your most likely theta=1 but only that you have a 95% chance that your theta is between (-3 and 4).  This is a pretty wide confidence interval on your estimate.  Thus we can see that some test values have more "information" than other test values.

Let's imagine testing a hypothesis Table 2:
H0: theta=-4 alpha=.05
Observe:
T=100 fail to reject
T=200 fail
T=300 fail
T=400 reject
T>400 reject

Thus we have enough information from this test to potentially reject the null when H0:theta=-4.  If however, the null was H0: theta=0 then only in the event T=100 could we reject the null at a 5% level and T=800 at a 10% level.

I hope this discussion is useful.  I certainly found it useful to think through.