Reading Large CSV Files Into R

Daniel Weibel

Created 4 Mar 2015

Experiment with an OpenCellID file of 330 MB.


Plain read	read.csv(file)	3 min. 24 sec.
With colClasses (1)	read.csv(file, colClasses)	57 sec.
With colClasses (1) and nrows	read.csv(file, colClasses, nrows)	55 sec.
With colClasses (2) and nrows	read.csv(file, colClasses, nrows)	55 sec.
With colClasses (3) and nrows	read.csv(file, colClasses, nrows)	60 sec.

Plain read

read.csv(file)

3 min. 24 sec

colclasses: ‘radio’ as factor, ‘lat, ‘lon’, ‘cid’ numeric, all others integer

read.csv(file, colClasses)

57 sec.

colclasses: ‘radio’ as factor, ‘lat, ‘lon’, ‘cid’ numeric, all others integer

read.csv(file, colClasses, nrows)

55 sec.

colclasses: ‘radio’ as factor, all others as numeric

read.csv(file, colClasses, nrows)

55 sec.

colclasses: ‘lat’, ‘lon’, ‘cid’ as numeric, all others as factors

read.csv(file, colClasses, nrows)

60 sec.

Setting colClasses brings a huge performance gain
Setting addtionally nrows further improves it a little bit
For speed, it doesn’t matter much if a column has class integer or numeric
- However, with integer, if a value is greater than $2^{31}$, there’s an error
- For space, there might be a difference, as integer uses 32 bits and numeric probably 64 bit (numeric is the same as double)
Making a column with integers class factor slows the execution time down