After spending a day the other week struggling to make sense of a federal data set shared in an archaic format (ASCII fixed format dat
file).
It is essential for the effective distribution and sharing of data that it use the minimum amount of disk space and be rapidly accessible for use by potential users.
In this post I test four different file formats available to R users. These formats are comma separated values csv
(write.csv()
), object representation format as a ASCII txt
(dput()
), a serialized R object (saveRDS()
), and a Stata file (write.dta()
from the foreign
package). For reference, rds
files seem to be identical to Rdata
files except that they deal with only one object rather than potentially multiple.
In order to get an idea of how and where different formats outperformed each other I simulated a dataset composed of different common data formats. These formats were the following:
Numeric Formats
- Index 1 to N - ex. 1,2,3,4,...
- Whole Numbers - ex. 30, 81, 73, 5, ...
- Big Numbers - ex. 36374.989943, 15280.050850, 5.908210, 79.890601, 2.857904, ...
- Continous Numbers - ex. 1.1681155, 1.6963295 0.8964436, -0.5227753, ...
Text Formats
- String coded factor variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd ...
- String coded factor variables with 16 characters coded as strings
- String coded factor variables with 64 characters coded as strings
- Factor coded variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd - coded as 1,2,4,3,2, ...
- Factor coded variables with 16 characters
- Factor coded variables with 64 characters
- String variables with random 4 characters - ex. jdhd, jdjj, ienz, lsdk, ...
- String variables with random 16 characters
- String variables with random 64 characters
What type of format a variable is in is a predictive characteristic for how much space that variable takes up and therefore how time consuming that variable is to read or write. For variables that are easy to describe they tend to take up little space. An index variable in an extreme example and can take up almost no space as it can be expressed in an extremely compact format (1:N
).
In contrast numbers which are very long or have a great degree of precision tend to have more information and therefore take more resources to access and store. String variables when filled with truly random or unique responses are some of the hardest data to compress as each value may be sampled from the full character spectrum. There is some significant potential for compression when strings are repeated in the variable. These repetitive entries can be either coded as a "factor" variable or a string variable in R.
As part of this exploration, I look at how string data is stored and saved when coded as either a string or as a factor within R.
Raw Files Let's first look at space taken when saving uncompressed files.
Figure 1: File Size
Figure 1 shows the file size of each of the saved variables when 10,000 observations are generated. The dataframe
object is the data.frame
composed off all of the variables. From the height of the dataframe
, we can see that rds
is overall the winner. Looking at the other variable values we can see only that csv
appear to consistently underperform for most file formats except for random strings.
Figure 2: File Size Log scaled
In Figure 2 we can see that rds
is consistently outperforming all of the other formats with the one exception of index
in which the txt
encoding simply reads 1:10000
. Apparently even serializing to bytes can't beat that.
Interestingly, there does not appear to be a effective size difference between repetitive strings encoded as factors accounting for the size of the strings (4, 16, or 64). We can see that the inability of csv
to compress factor strings dramatically penalizes the efficiency of csv
relative to the other formats.
File Compression But data is rarely shared in uncompressed formats. How does compression change things?
Figure 3: Zipped File Sizes Logged
We can see from Figure 3 that if we zip our data after saving, the file size can do pretty much as well as rds
. Comma delineated csv
files are a bit of an exception with factor variables suffering under csv
. Yet random strings perform slightly better under csv
than other formats. Interesting rds
files seem slightly larger than the other two file types. Overall though, it is pretty hard to see any significant difference in file size based on format after zipping.
So, should we stick with whatever format we prefer?
Not so fast. Sure, all of the files are similarly sized after zipping. This is useful for sharing files. But having to keep large file sizes on a hard drive is not ideal even if they can be compressed for distribution. There is finite space on any system and some files can be in the hundreds of MB to hundreds of GB range. Dealing with file formats and multiple file versions which are this large can easily drain the permanent storage capacity of most systems.
But an equally important concern, is how long it takes write and read different file formats.
Reading Speed In order to test reading speeds, I loaded each of the different full dataframe files fifty times. I also tested how long it would take to unzip then load that file.
Figure 4: Reading and unzipping average speeds
From Figure 4, we can see that read speeds roughly correspond with the size of files. We can see that even a relatively small file (30 MB csv
file) can take as long as 7 seconds to open. Working with large files saved in an inefficient format can be very frustrating.
In contrast, saving files in efficient formats can dramatically cut down on the time taken opening those files. Using the most efficient format (rds
), files could be 100 times larger than those used in this simulation and still open in less than a minute.
Conclusions Finding common file formats that any software can access is not easy. As a result many public data sets are provided in archaic formats which are poorly suited for end users.
This results in a wide pool of software sweets having the ability to access these datasets. However, with inefficient file formats comes a higher demand on the hardware of end users. I am unlikely to be the only person struggling with opening some of these large "public access" datasets.
Those maintaining these datasets will argue that sticking with the standard, inefficient format is the best of bad options. However, there is no reason they could not post datasets in rds
formats in addition to the outdated formats they currently exist in.
And no we need not argue that selecting one software language to save data in will be biased toward those languages. Already many federal databases come with code supplements in Stata, SAS, or SPSS. To access these supplements, one is required to have paid access to that software.
Yet, R is free and its database format is public domain. Any user could download R, open a rds
or Rdata
file, then save that file in a format more suited to their purposes. None of these other proprietary database formats can boast the same.
No comments:
Post a Comment