Parsing Comma-Separated Data

Problem

You have a plain-text string in a comma-delimited format. You need to parse this string, either to build a data structure or to perform some operation on the data and write it back out.

Solution

The built-in csv library can parse most common character-delimited formats. The FasterCSV library, available as the fastercsv gem, improves on csvs performance and interface. Ill show you both, but I recommend fastercsv unless you can use any software at all outside the standard library.

CSV::Reader.parse and FasterCSV.parse work the same way: they accept a string or an open file as an argument, and yield each parsed row of the comma-delimited file as an array. The csv yields a Row object that acts like an array full of Column objects. FasterCSV just yields an array of strings.

	require csv
	primary_colors = "red,green,blue
red,yellow,blue"

	CSV::Reader.parse(primary_colors) { |row| row.each { |cell| puts cell }}
	# red
	# green
	# blue
	# red
	# yellow
	# blue

	require 
ubygems
	require faster_csv
	shakespeare = %{Sweet are the uses of adversity,As You Like It
	"We few, we happy few",Henry V
	"Seems, madam! nay it is; I know not ""seems.""",Hamlet}

	FasterCSV.parse(shakespeare) { |row| puts "#{row[0]} -- #{row[1]}"}
	# Sweet are the uses of adversity -- As You Like It
	# We few, we happy few -- Henry V
	# Seems, madam! nay it is; I know not "seems." -- Hamlet

Discussion

Comma-delimited formats are among the most basic portable file formats. Unfortunately, they e also among the least standardized. There are many different formats, and some are internally inconsistent.

FasterCSV and the csv library can parse every comma-delimited format, but they will parse common formats like the one used by Microsoft Excel, and they e your best tool for making sense of the myriad.

FasterCSV and csv both model a comma-delimited file as a nested array of strings. The csv librarys CSV class uses Row objects and Column objects instead of arrays and strings, but its the same idea. The terminology is from the spreadsheet worldunderstand-ably, since a CSV file is a common way of portably storing spreadsheet data.

The complications begin when the spreadsheet cells themselves contain commas or newlines. The standard way to handle this when exporting to comma-delimited format is to surround those cells with double quotes. Then the question becomes what to do with cells that contain double-quote characters. Both Ruby CSV libraries assume that double-quote characters are escaped by doubling, turning each " into "", as in the Hamlet quotation:

	%{"Seems, madam! nay it is; I know not ""seems.""",Hamlet}

If you e certain that there are no commas or newlines embedded in your data, and thus no need for quote handling, you can use String#split to parse delimited records more quickly than csv. To output to this format, you can use Array#join:

	def parse_delimited_naive(input, fieldsep=,, rowsep="
")
	 input.split(rowsep).inject([]) do |arr, line|
	 arr << line.split(fieldsep)
	 end
	end

	def join_delimited_naive(structure, fieldsep=,, rowsep="
")
	 rows = structure.inject([]) do |arr, parsed_line|
	 arr << parsed_line.join(fieldsep)
	 end
	 rows.join(rowsep)
	end

	parse_delimited_naive("1,2,3,4
5,6,7,8")
	# => [["1", "2", "3", "4"], ["5", "6", "7", "8"]]

	join_delimited_naive(parse_delimited_naive("1,2,3,4
5,6,7,8"))
	# => "1,2,3,4
5,6,7,8"

	parse_delimited_naive(1;2;3;4|5;6;7;8, ;, |)
	# => [["1", "2", "3", "4"], ["5", "6", "7", "8"]]

	parse_delimited_naive(1,"2,3",4)
	# => [["1", ""2", "3"", "4"]]

This is not recommended unless you wrote all the relevant code yourself, or can manually inspect the code as well as the dataset. Just because you haven seen any quoted cells yet doesn mean there won be any in the future. When in doubt, use csv or fastercsv. Handwritten CSV generators and parsers are a leading cause of bad data.

To create a comma-delimited file, open an output file with CSV.open or FasterCSV.open, and append a series of arrays to the resulting file-like object. Every array you append will be converted to a comma-delimited row in the destination file.

	data = [[1,2,3],[A,B,C],[do,
e,mi]]

	writer = FasterCSV.open(first3.csv, w)
	data.each { |x| writer << x }
	writer.close
	puts open(first3.csv).read()
	# 1,2,3
	# A,B,C
	# do,re,mi

	data = []
	FasterCSV.foreach(first3.csv) { |row| data << row }
	data
	# => [["1", "2", "3"], ["A", "B", "C"], ["do", "re", "mi"]]

See Also

  • The FasterCSV documentation (http://fastercsv.rubyforge.org/)
  • Chapter 11


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net