Credit: Mauro Cicio
You want to know the character encoding of a document that doesn declare it explicitly.
Use the Ruby bindings to the libcharguess library. Once its installed, using libcharguess is very simple.
Heres an XML document written in Italian, with no explicit encoding:
doc = %{ }
Lets find its encoding:
require charguess CharGuess::guess doc # => "windows-1252"
This is a pretty good guess: the XML is written in the ISO-8859-1 encoding, and many web browsers treat ISO-8859-1 as Windows-1252.
In XML, the character-encoding indication is optional, and may be provided as an attribute of the XML declaration in the first line of the document:
If this is missing, you must guess the document encoding to process the document. You can assume the lowest common denominator for your community (usually this means assuming that everything is either UTF-8 or ISO-8859-1), or you can use a library that examines the document and uses heuristics to guess the encoding.
As of the time of writing, there are no pure Ruby libraries for guessing the encoding of a document. Fortunately, there is a small Ruby wrapper around the Charguess library. This library can guess with 95% accuracy the encoding of any text whose charset is one of the following: BIG5, HZ, JIS, SJIS, EUC-JP, EUC-KR, EUC-TW, GB2312, Bulgarian, Cyrillic, Greek, Hungarian, Thai, Latin1, and UTF8.
Note that Charguess is not XML-or HTML-specific. In fact, it can guess the encoding of an arbitrary string:
CharGuess::guess("xA4xCF") # => "EUC-JP"
Its fairly easy to install libcharguess, since the library is written in portable C++. Unfortunately, it doesn take care to put its header files in a standard location. This makes it a little tricky to compile the Ruby bindings, which depend on the charguess.h header. When you run extconf.rb to prepare the bindings, you must explicitly tell the script where to find libcharguesss headers. Heres how you might compile the Ruby bindings to libcharguess:
$ ruby extconf.rb --with-charguess-include=/location/of/charguess.h $ make $ make install
Strings
Numbers
Date and Time
Arrays
Hashes
Files and Directories
Code Blocks and Iteration
Objects and Classes8
Modules and Namespaces
Reflection and Metaprogramming
XML and HTML
Graphics and Other File Formats
Databases and Persistence
Internet Services
Web Development Ruby on Rails
Web Services and Distributed Programming
Testing, Debugging, Optimizing, and Documenting
Packaging and Distributing Software
Automating Tasks with Rake
Multitasking and Multithreading
User Interface
Extending Ruby with Other Languages
System Administration