Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readdlm and fixed column data #5391

Closed
KenziTrader opened this issue Jan 14, 2014 · 8 comments
Closed

readdlm and fixed column data #5391

KenziTrader opened this issue Jan 14, 2014 · 8 comments

Comments

@KenziTrader
Copy link

I have a file with fixed column data. The first few lines are:

5 7 35 1.400 .400 .657 2.33 14 23 6 1
6 7 42 1.167 .429 .881 3.60 18 37 5 1
6 18 108 3.000 .287 .741 4.43 31 80 7 1

If I use

page_blocks = readdlm("page-blocks.data")

it reads every row as a single string.

If I use

page_blocks = readdlm("page-blocks.data", Float64)

it reads the data as a long array of NaNs.

If I use

page_blocks = readdlm("page-blocks.data",(Int64, Int64,Int64,Float64,Float64,Float64,Float64,Int64,Int64,Int64, Int64))

I get the error:

ERROR: file entry " 5 7 35 1.400 .400 .657 2.33 14 23 6 1" cannot be converted to (Int64,Int64,Int64,Float64,Float64,Float64,Float64,Int64,Int64,Int64,Int64)
in error at error.jl:21
in dlm_fill at datafmt.jl:135
in readdlm_string at datafmt.jl:82
in readdlm_auto at datafmt.jl:50
in readdlm at datafmt.jl:42
in readdlm at datafmt.jl:35

How can I read fixed column data with readdlm?

@acroy
Copy link
Contributor

acroy commented Jan 14, 2014

You need to specify a delimiter as second argument in readdlm, otherwise char(0xfffffffe) is used for some (probably good) reason.

readdlm("../../test.dat",' ')
3x11 Array{Float64,2}:
 5.0   7.0   35.0  1.4    0.4    0.657  2.33  14.0  23.0  6.0  1.0
 6.0   7.0   42.0  1.167  0.429  0.881  3.6   18.0  37.0  5.0  1.0
 6.0  18.0  108.0  3.0    0.287  0.741  4.43  31.0  80.0  7.0  1.0

seems to work fine.

@KenziTrader
Copy link
Author

If you specify a delimiter then the padding spaces inside the (fixed width) fields are interpreted as separate columns. Thus the rows then have a different number of columns.

The original file is:
http://archive.ics.uci.edu/ml/machine-learning-databases/page-blocks/page-blocks.data.Z

I get a BoundsError:

page_blocks = readdlm("page-blocks.data", ' ')

ERROR: BoundsError()
in getindex at ascii.jl:11
in dlm_fill at datafmt.jl:116
in dlm_fill at datafmt.jl:126
in readdlm_string at datafmt.jl:82
in readdlm_auto at datafmt.jl:50
in readdlm at datafmt.jl:41
in readdlm at datafmt.jl:39

@acroy
Copy link
Contributor

acroy commented Jan 14, 2014

I see. Seems readdlm is not able to handle this case. As a workaround you could use CSV format?

There is also readtable in the package DataFrames.jl. However, I just tried it with test2.dat containing

 5.0   7.0   35.0  1.4    0.4    0.657  2.33  14.0  23.0  6.0  1.0
 6.0   7.0   42.0  1.167  0.429  0.881  3.6   18.0  37.0  5.0  1.0
 6.0  18.0  108.0  3.0    0.287  0.741  4.43  31.0  80.0  7.0  1.0

which gave

julia> readtable("../../test2.dat", separator=' ', header = false)
ERROR: Saw 3 rows, 25 columns and 77 fields
 * Line 1 has 30 columns

 in error at error.jl:21
 in findcorruption at /Users/acr/.julia/DataFrames/src/io.jl:480
 in readtable! at /Users/acr/.julia/DataFrames/src/io.jl:526
 in readtable at /Users/acr/.julia/DataFrames/src/io.jl:595

Maybe someone else (cc: @johnmyleswhite) has an idea?

@johnmyleswhite
Copy link
Member

We don't support fixed width fields yet. It's not that hard: I might even be able to finish a demo on the way to work today. But fixed width files have almost nothing in common with delimited files, so our existing infrastructure is only slightly usable.

@JeffBezanson
Copy link
Member

Maybe we could add an option to readdlm to skip empty columns; that might handle cases like this.

@KenziTrader
Copy link
Author

Well, maybe readdlm is not appropriate for reading fixed width data but I just want to read my data. For this file I converted it into a comma separated file and could read it. It would be nice to be able to read it without the conversion.

@johnmyleswhite
Copy link
Member

I've started the work of doing this in JuliaData/DataFrames.jl#475. I started with binary files, which were more useful to me. I'll get to text files soon.

@tanmaykm
Copy link
Member

PR #5400 addresses the issue of default delimiters not being applied and the BoundsError when there are empty columns. With this patch readdlm will parse both the example files mentioned here without errors. It however still can not read fixed width format. @JeffBezanson's suggestion looks like a good option for fixed width data.

JeffBezanson added a commit that referenced this issue Jan 15, 2014
address readdlm default delimiter and boundserror. ref: #5391
tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 15, 2014
tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 16, 2014
…ingle delimiter.

fixed bug in handling empty columns.
updated tests and docs.
fixes JuliaLang#5391
tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 16, 2014
…ingle delimiter.

fixed bug in handling empty columns.
updated tests and docs.
fixes JuliaLang#5391
tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 16, 2014
…ingle delimiter.

fixed bug in handling empty columns.
updated tests and docs.
fixes JuliaLang#5391
tanmaykm added a commit to tanmaykm/julia that referenced this issue Jan 16, 2014
…ingle delimiter.

fixed bug in handling empty columns.
updated tests and docs.
fixes JuliaLang#5391
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants