Notes

Reading Large CSV Files Into R

Daniel Weibel
Created 4 Mar 2015

Experiment with an OpenCellID file of 330 MB.

     
Plain read read.csv(file) 3 min. 24 sec.
With colClasses (1) read.csv(file, colClasses) 57 sec.
With colClasses (1) and nrows read.csv(file, colClasses, nrows) 55 sec.
With colClasses (2) and nrows read.csv(file, colClasses, nrows) 55 sec.
With colClasses (3) and nrows read.csv(file, colClasses, nrows) 60 sec.

Plain read

read.csv(file)

3 min. 24 sec

Adding colClasses argument

colclasses: ‘radio’ as factor, ‘lat, ‘lon’, ‘cid’ numeric, all others integer

read.csv(file, colClasses)

57 sec.

Adding nrows argument

colclasses: ‘radio’ as factor, ‘lat, ‘lon’, ‘cid’ numeric, all others integer

read.csv(file, colClasses, nrows)

55 sec.

colClasses variations: numeric instead of integer

colclasses: ‘radio’ as factor, all others as numeric

read.csv(file, colClasses, nrows)

55 sec.

colClasses variations: factors instead of integer

colclasses: ‘lat’, ‘lon’, ‘cid’ as numeric, all others as factors

read.csv(file, colClasses, nrows)

60 sec.

Conclusion

  • Setting colClasses brings a huge performance gain
  • Setting addtionally nrows further improves it a little bit
  • For speed, it doesn’t matter much if a column has class integer or numeric
    • However, with integer, if a value is greater than $2^{31}$, there’s an error
    • For space, there might be a difference, as integer uses 32 bits and numeric probably 64 bit (numeric is the same as double)
  • Making a column with integers class factor slows the execution time down