Parsing Not-Quite-Comma-Separated Data

Problem

You need to parse a plain- text string or file thats in a format similar to commadelimited format, but its delimiters are some strings other than commas and newlines.

Solution

When you call a CSV::Reader method, you can specify strings to act as a row separator (the string between each Row) and a field separator (the string between each Column). You can do the same with simulated keyword arguments passed into FasterCSV.parse. This should let you parse most formats similar to the comma-delimited format:

	require csv

	pipe_separated="1|2ENDa|bEND"

	 
CSV::Reader.parse(pipe_separated, |, END) { |r| r.each { |c| puts c } }
	# 1
	# 2
	# a
	# b

	require 
ubygems
	require faster_csv
	FasterCSV.parse(pipe_separated, :col_sep=>|, :row_sep=>END) do |r|
	 r.each { |c| puts c }
	end
	# 1
	# 2
	# a
	# b

Discussion

Value-delimited formats tend to differ along three axes:

  • The field separator (usually a single comma)
  • The row separator (usually a single newline)
  • The quote character (usually a double quote)

Like Reader methods, Writer methods accept custom values for the field and row separators.

	data = [[1,2,3],[A,B,C],[do,
e,mi]]

	open(first3.csv, w) do |output|
	 CSV::Writer.generate(output, :, -END-) do |writer|
	 data.each { |x| writer << x }
	 end
	end
	open(first3.csv) { |input| input.read() }
	# => "1:2:3-END-A:B:C-END-do:re:mi-END-"

	FasterCSV.open(first3.csv, w, :col_sep=>:, :row_sep=>-END-) do |output|
	 data.each { |x| output << x }
	end
	open(first3.csv) { |input| input.read() }
	# => "1:2:3-END-A:B:C-END-do:re:mi-END-"

Its rare that youll need to override the quote character, and neither csv nor fastercsv will let you do it. Both libraries quote characters are hardcoded to the double-quote character. If you need to parse a format that has different quote character, the simplest thing to do is subclass FasterCSV and override its init_parsers method.

Change the regular expression assigned to @parsers[:csv_row], replacing all double quotes with the quote character you want. The most common alternate quote character is the single quote: to get that, youd have an init_parsers method like this:

	class MyFasterCSV < FasterCSV
	 def init_parsers(options)
	 super
	 @parsers[:csv_row] =
	 / G(?:^|#{Regexp.escape(@col_sep)}) # anchor the match
	 (?: ((?>[^]*)(?>\[^]*)*) # find quoted fields
	 | # … or …
	 ([^#{Regexp.escape(@col_sep)}]*) # unquoted fields
	 )/x
	 end
	end
	MyFasterCSV.parse("1,2,3,4") { |r| puts r }
	# 1
	# 2,3
	# 4

Some value-delimited files are simply corrupt: they were generated by programs that didn think to escape quote marks or to quote cells with embedded delimiters. Neither csv nor fastercsv can parse these files, because they e ambiguous or invalid.

	missing_quotes=%{20051002, Alice says, "I saw that!"}
	CSV::Reader.parse(missing_quotes) { |r| r.each { |c| puts c } }
	# CSV::IllegalFormatError: CSV::IllegalFormatError

	unescaped_quotes=%{20051002, "Alice says, "I saw that!""}
	FasterCSV.parse(unescaped_quotes) { |r| r.each { |c| puts c } }
	# FasterCSV::MalformedCSVError: Unclosed quoted field.

Your best strategy for dealing with this kind of file is to use regular expressions to massage the data into a form that fastercsv can parse, or to parse it with String#split and deal with any quoting problems afterwards. In either case, your code will have to work with the particular quirks of the data you e trying to parse.

See Also

  • Recipe 12.7, " Parsing Comma-Separated Data"


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net