Parsing Not-Quite-Comma-Separated Data

Table of contents:

Problem

You need to parse a plain- text string or file thats in a format similar to commadelimited format, but its delimiters are some strings other than commas and newlines.

Solution

When you call a CSV::Reader method, you can specify strings to act as a row separator (the string between each Row) and a field separator (the string between each Column). You can do the same with simulated keyword arguments passed into FasterCSV.parse. This should let you parse most formats similar to the comma-delimited format:

	require csv

	pipe_separated="1|2ENDa|bEND"

	 
CSV::Reader.parse(pipe_separated, |, END) { |r| r.each { |c| puts c } }
	# 1
	# 2
	# a
	# b

	require 
ubygems
	require faster_csv
	FasterCSV.parse(pipe_separated, :col_sep=>|, :row_sep=>END) do |r|
	 r.each { |c| puts c }
	end
	# 1
	# 2
	# a
	# b

Discussion

Value-delimited formats tend to differ along three axes:

The field separator (usually a single comma)
The row separator (usually a single newline)
The quote character (usually a double quote)

Like Reader methods, Writer methods accept custom values for the field and row separators.

	data = [[1,2,3],[A,B,C],[do,
e,mi]]

	open(first3.csv, w) do |output|
	 CSV::Writer.generate(output, :, -END-) do |writer|
	 data.each { |x| writer << x }
	 end
	end
	open(first3.csv) { |input| input.read() }
	# => "1:2:3-END-A:B:C-END-do:re:mi-END-"

	FasterCSV.open(first3.csv, w, :col_sep=>:, :row_sep=>-END-) do |output|
	 data.each { |x| output << x }
	end
	open(first3.csv) { |input| input.read() }
	# => "1:2:3-END-A:B:C-END-do:re:mi-END-"

Its rare that youll need to override the quote character, and neither csv nor fastercsv will let you do it. Both libraries quote characters are hardcoded to the double-quote character. If you need to parse a format that has different quote character, the simplest thing to do is subclass FasterCSV and override its init_parsers method.

Change the regular expression assigned to @parsers[:csv_row], replacing all double quotes with the quote character you want. The most common alternate quote character is the single quote: to get that, youd have an init_parsers method like this:

	class MyFasterCSV < FasterCSV
	 def init_parsers(options)
	 super
	 @parsers[:csv_row] =
	 / G(?:^|#{Regexp.escape(@col_sep)}) # anchor the match
	 (?: ((?>[^]*)(?>\[^]*)*) # find quoted fields
	 | # … or …
	 ([^#{Regexp.escape(@col_sep)}]*) # unquoted fields
	 )/x
	 end
	end
	MyFasterCSV.parse("1,2,3,4") { |r| puts r }
	# 1
	# 2,3
	# 4

Some value-delimited files are simply corrupt: they were generated by programs that didn think to escape quote marks or to quote cells with embedded delimiters. Neither csv nor fastercsv can parse these files, because they e ambiguous or invalid.

	missing_quotes=%{20051002, Alice says, "I saw that!"}
	CSV::Reader.parse(missing_quotes) { |r| r.each { |c| puts c } }
	# CSV::IllegalFormatError: CSV::IllegalFormatError

	unescaped_quotes=%{20051002, "Alice says, "I saw that!""}
	FasterCSV.parse(unescaped_quotes) { |r| r.each { |c| puts c } }
	# FasterCSV::MalformedCSVError: Unclosed quoted field.

Your best strategy for dealing with this kind of file is to use regular expressions to massage the data into a form that fastercsv can parse, or to parse it with String#split and deal with any quoting problems afterwards. In either case, your code will have to work with the particular quirks of the data you e trying to parse.

Problem

Solution

Discussion

See Also