Guessing a Documents Encoding

Credit: Mauro Cicio


You want to know the character encoding of a document that doesn declare it explicitly.


Use the Ruby bindings to the libcharguess library. Once its installed, using libcharguess is very simple.

Heres an XML document written in Italian, with no explicit encoding:

	doc = %{
	 spaghetti al ragù

Lets find its encoding:

	require charguess

	CharGuess::guess doc
	# => "windows-1252"

This is a pretty good guess: the XML is written in the ISO-8859-1 encoding, and many web browsers treat ISO-8859-1 as Windows-1252.


In XML, the character-encoding indication is optional, and may be provided as an attribute of the XML declaration in the first line of the document:


If this is missing, you must guess the document encoding to process the document. You can assume the lowest common denominator for your community (usually this means assuming that everything is either UTF-8 or ISO-8859-1), or you can use a library that examines the document and uses heuristics to guess the encoding.

As of the time of writing, there are no pure Ruby libraries for guessing the encoding of a document. Fortunately, there is a small Ruby wrapper around the Charguess library. This library can guess with 95% accuracy the encoding of any text whose charset is one of the following: BIG5, HZ, JIS, SJIS, EUC-JP, EUC-KR, EUC-TW, GB2312, Bulgarian, Cyrillic, Greek, Hungarian, Thai, Latin1, and UTF8.

Note that Charguess is not XML-or HTML-specific. In fact, it can guess the encoding of an arbitrary string:

	CharGuess::guess("xA4xCF") # => "EUC-JP"

Its fairly easy to install libcharguess, since the library is written in portable C++. Unfortunately, it doesn take care to put its header files in a standard location. This makes it a little tricky to compile the Ruby bindings, which depend on the charguess.h header. When you run extconf.rb to prepare the bindings, you must explicitly tell the script where to find libcharguesss headers. Heres how you might compile the Ruby bindings to libcharguess:

	$ ruby extconf.rb --with-charguess-include=/location/of/charguess.h
	$ make
	$ make install

See Also

  • To find your way through the jungle of character encodings, the Wikipedia entry on character encodings makes a good reference (http://en.wikipedia.org/wiki/Character_encoding)
  • A good source for sample texts in various charsets is http://vancouver-webpages.com/multilingual/
  • The XML specification has a section on character encoding autodetection (http://www.w3.org/TR/REC-xml/#sec-guessing)
  • The Charguess library is at http://libcharguess.sourceforge.net; its Ruby bindings are available from http://raa.ruby-lang.org/project/charguess



Date and Time



Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming


Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration

Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Similar book on Amazon

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net