This is a tutorial for using REXML, a pure Ruby XML processor.
REXML was inspired by the Electric XML library for Java, which features an easy-to-use API, small size, and speed. Hopefully, REXML, designed with the same philosophy, has these same features. I've tried to keep the API as intuitive as possible, and have followed the Ruby methodology for method naming and code flow, rather than mirroring the Java API.
REXML supports both tree and stream document parsing. Stream parsing is faster (about 1.5 times as fast). However, with stream parsing, you don't get access to features such as XPath.
The API documentation also contains code snippits to help you learn how to use various methods. This tutorial serves as a starting point and quick guide to using REXML.
We'll start with parsing an XML document
require "rexml/document" file = File.new( "mydoc.xml" ) doc = REXML::Document.new file
Line 3 creates a new document and parses the supplied file. You can also do the following
require "rexml/document" include REXML # so that we don't have to prefix everything with REXML::... string = <<EOF <mydoc> <someelement attribute="nanoo">Text, text, text</someelement> </mydoc> EOF doc = Document.new string
So parsing a string is just as easy as parsing a file. For future examples, I'm going to omit both the require and include lines.
Once you have a document, you can access elements in that document in a number of ways:
Here are a few examples using these methods. First is the source document used in the examples. Save this as mydoc.xml before running any of the examples that require it:
<inventory title="OmniCorp Store #45x10^3"> <section name="health"> <item upc="123456789" stock="12"> <name>Invisibility Cream</name> <price>14.50</price> <description>Makes you invisible</description> </item> <item upc="445322344" stock="18"> <name>Levitation Salve</name> <price>23.99</price> <description>Levitate yourself for up to 3 hours per application</description> </item> </section> <section name="food"> <item upc="485672034" stock="653"> <name>Blork and Freen Instameal</name> <price>4.95</price> <description>A tasty meal in a tablet; just add water</description> </item> <item upc="132957764" stock="44"> <name>Grob winglets</name> <price>3.56</price> <description>Tender winglets of Grob. Just add water</description> </item> </section> </inventory>
doc = Document.new File.new("mydoc.xml") doc.elements.each("inventory/section") { |element| puts element.attributes["name"] } # -> health # -> food doc.elements.each("*/section/item") { |element| puts element.attributes["upc"] } # -> 123456789 # -> 445322344 # -> 485672034 # -> 132957764 root = doc.root puts root.attributes["title"] # -> OmniCorp Store #45x10^3 puts root.elements["section/item[@stock='44']"].attributes["upc"] # -> 132957764 puts root.elements["section"].attributes["name"] # -> health (returns the first encountered matching element) puts root.elements[1].attributes["name"] # -> health (returns the FIRST child element) root.detect {|node| node.kind_of? Element and node.attributes["name"] == "food" }
Notice the second-to-last line of code. Element children in REXML are indexed starting at 1, not 0. This is because XPath itself counts elements from 1, and REXML maintains this relationship; IE, root.elements['*[1]'] == root.elements[1]. The last line finds the first child element with the name of "food". As you can see in this example, accessing attributes is also straightforward.
You can also access xpaths directly via the XPath class.
# The invisibility cream is the first <item> invisibility = XPath.first( doc, "//item" ) # Prints out all of the prices XPath.each( doc, "//price") { |element| puts element.text } # Gets an array of all of the "name" elements in the document. names = XPath.match( doc, "//name" )
Another way of getting an array of matching nodes is through Element.elements.to_a(). Although this is a method on elements, if passed an XPath it can return an array of arbitrary objects. This is due to the fact that XPath itself can return arbitrary nodes (Attribute nodes, Text nodes, and Element nodes).
all_elements = doc.elements.to_a all_children = doc.to_a all_upc_strings = doc.elements.to_a( "//item/attribute::upc" ) all_name_elements = doc.elements.to_a( "//name" )
REXML attempts to make the common case simple, but this means that the uncommon case can be complicated. This is especially true with Text nodes.
Text nodes have a lot of behavior, and in the case of internal entities, what you get may be different from what you expect. When REXML reads an XML document, in parses the DTD and creates an internal table of entities. If it finds any of these entities in the document, it replaces them with their values:
doc = Document.new '<!DOCTYPE foo [ <!ENTITY ent "replace"> ]><a>&ent;</a>' doc.root.text #-> "replace"
When you write the document back out, REXML replaces the values with the entity reference:
doc.to_s # Generates: # <!DOCTYPE foo [ # <!ENTITY ent "replace"> # ]><a>&ent;</a>
But there's a problem. What happens if only some of the words are also entity reference values?
doc = Document.new '<!DOCTYPE foo [ <!ENTITY ent "replace"> ]><a>replace &ent;</a>' doc.root.text #-> "replace replace"
Well, REXML does the only thing it can:
doc.to_s # Generates: # <!DOCTYPE foo [ # <!ENTITY ent "replace"> # ]><a>&ent; &ent;</a>
This is probably not what you expect. However, when designing REXML, I had a choice between this behavior, and using immutable text nodes. The problem is that, if you can change the text in a node, REXML can never tell which tokens you want to have replaced with entities. There is a wrinkle: REXML will write what it gets in as long as you don't access the text. This is because REXML does lazy evaluation of entities. Therefore,
doc = Document.new( '<!DOCTYPE foo [ <!ENTITY ent "replace"> ]><a>replace &ent;</a>' ) doc.to_s # Generates: # <!DOCTYPE foo [ # <!ENTITY ent "replace"> # ]><a>replace &ent;</a> doc.root.text #-> Now accessed, entities have been resolved doc.to_s # Generates: # <!DOCTYPE foo [ # <!ENTITY ent "replace"> # ]><a>&ent; &ent;</a>
There is a programmatic solution: :raw. If you set the :raw flag on any Text or Element node, the entities within that node will not be processed. This means that you'll have to deal with entities yourself:
doc = Document.new('<!DOCTYPE foo [ <!ENTITY ent "replace"> ]><a>replace &ent;</a>',{:raw=>:all}) doc.root.text #-> "replace &ent;" doc.to_s # Generates: # <!DOCTYPE foo [ # <!ENTITY ent "replace"> # ]><a>replace &ent;</a>
Again, there are a couple of mechanisms for creating XML documents in REXML. Adding elements by hand is faster than the convenience method, but which you use will probably be a matter of aesthetics.
el = someelement.add_element "myel" # creates an element named "myel", adds it to "someelement", and returns it el2 = el.add_element "another", {"id"=>"10"} # does the same, but also sets attribute "id" of el2 to "10" el3 = Element.new "blah" el1.elements << el3 el3.attributes["myid"] = "sean" # creates el3 "blah", adds it to el1, then sets attribute "myid" to "sean"
If you want to add text to an element, you can do it by either creating Text objects and adding them to the element, or by using the convenience method text=
el1 = Element.new "myelement" el1.text = "Hello world!" # -> <myelement>Hello world!</myelement> el1.add_text "Hello dolly" # -> <myelement>Hello world!Hello dolly</element> el1.add Text.new("Goodbye") # -> <myelement>Hello world!Hello dollyGoodbye</element> el1 << Text.new(" cruel world") # -> <myelement>Hello world!Hello dollyGoodbye cruel world</element>
But note that each of these text objects are still stored as separate objects; el1.text will return "Hello world!"; el1[2] will return a Text object with the contents "Goodbye".
Please be aware that all text nodes in REXML are UTF-8 encoded, and all of your code must reflect this. You may input and output other encodings (UTF-8, UTF-16, ISO-8859-1, and UNILE are all supported, input and output), but within your program, you must pass REXML UTF-8 strings.
I can't emphasize this enough, because people do have problems with this. REXML can't possibly alway guess correctly how your text is encoded, so it always assumes the text is UTF-8. It also does not warn you when you try to add text which isn't properly encoded, for the same reason. You must make sure that you are adding UTF-8 text. If you're adding standard 7-bit ASCII, which is most common, you don't have to worry. If you're using ISO-8859-1 text (characters above 0x80), you must convert it to UTF-8 before adding it to an element. You can do this with the shard: text.unpack("C*").pack("U*"). If you ignore this warning and add 8-bit ASCII characters to your documents, your code may work... or it may not. In either case, REXML is not at fault. You have been warned.
One last thing: alternate encoding output support only works from Document.write() and Document.to_s(). If you want to write out other nodes with a particular encoding, you must wrap your output object with Output:
e = Element.new "<a/>" e.text = "f\xfcr" # ISO-8859-1 '??' o = '' e.write( Output.new( o, "ISO-8859-1" ) )
You can pass Output any of the supported encodings.
If you want to insert an element between two elements, you can use either the standard Ruby array notation, or Parent.insert_before and Parent.insert_after.
doc = Document.new "<a><one/><three/></a>" doc.root[1,0] = Element.new "two" # -> <a><one/><two/><three/></a> three = doc.elements["a/three"] doc.root.insert_after three, Element.new "four" # -> <a><one/><two/><three/><four/></a> # A convenience method allows you to insert before/after an XPath: doc.root.insert_after( "//one", Element.new("one-five") ) # -> <a><one/><one-five/><two/><three/><four/></a> # Another convenience method allows you to insert after/before an element: four = doc.elements["//four"] four.previous_sibling = Element.new("three-five") # -> <a><one/><one-five/><two/><three/><three-five/><four/></a>
The raw flag in the Text constructor can be used to tell REXML to leave strings which have entities defined for them alone.
doc = Document.new( "<?xml version='1.0?> <!DOCTYPE foo SYSTEM 'foo.dtd' [ <!ENTITY % s "Sean"> ]> <a/>" t = Text.new( "Sean", false, nil, false ) doc.root.text = t t.to_s # -> &s; t = Text.new( "Sean", false, nil, true ) doc.root.text = t t.to_s # -> Sean
Note that, in all cases, the value() method returns the text with entities expanded, so the raw flag only affects the to_s() method. If the raw is set for a text node, then to_s() will not entities will not normalize (turn into entities) entity values. You can not create raw text nodes that contain illegal XML, so the following will generate a parse error:
t = Text.new( "&", false, nil, true )
You can also tell REXML to set the Text children of given elements to raw automatically, on parsing or creating:
doc = REXML::Document.new( source, { :raw => %w{ tag1 tag2 tag3 } }
In this example, all tags named "tag1", "tag2", or "tag3" will have any Text children set to raw text. If you want to have all of the text processed as raw text, pass in the :all tag:
doc = REXML::Document.new( source, { :raw => :all })
There aren't many things that are more simple than writing a REXML tree. Simply pass an object that supports <<( String ) to the write method of any object. In Ruby, both IO instances (File) and String instances support <<.
doc.write $stdout output = "" doc.write output
If you want REXML to pretty-print output, pass write() an indent value greater than -1:
doc.write( $stdout, 0 )
REXML will not, by default, write out the XML declaration unless you specifically ask for them. If a document is read that contains an XML declaration, that declaration will be written faithfully. The other way you can tell REXML to write the declaration is to specifically add the declaration:
doc = Document.new doc.add_element 'foo' doc.to_s #-> <foo/> doc << XMLDecl.new doc.to_s #-> <?xml version='1.0'?><foo/>
There are four main methods of iterating over children. Element.each, which iterates over all the children; Element.elements.each, which iterates over just the child Elements; Element.next_element and Element.previous_element, which can be used to fetch the next Element siblings; and Element.next_sibling and Eleemnt.previous_sibling, which fetches the next and previous siblings, regardless of type.
REXML stream parsing requires you to supply a Listener class. When REXML encounters events in a document (tag start, text, etc.) it notifies your listener class of the event. You can supply any subset of the methods, but make sure you implement method_missing if you don't implement them all. A StreamListener module has been supplied as a template for you to use.
list = MyListener.new source = File.new "mydoc.xml" REXML::Document.parse_stream(source, list)
Stream parsing in REXML is much like SAX, where events are generated when the parser encounters them in the process of parsing the document. When a tag is encountered, the stream listener's tag_start() method is called. When the tag end is encountered, tag_end() is called. When text is encountered, text() is called, and so on, until the end of the stream is reached. One other note: the method entity() is called when an &entity; is encountered in text, and only then.
Please look at the StreamListener API for more information.1
By default, REXML respects whitespace in your document. In many applications, you want the parser to compress whitespace in your document. In these cases, you have to tell the parser which elements you want to respect whitespace in by passing a context to the parser:
doc = REXML::Document.new( source, { :compress_whitespace => %w{ tag1 tag2 tag3 } }
Whitespace for tags "tag1", "tag2", and "tag3" will be compressed; all other tags will have their whitespace respected. Like :raw, you can set :compress_whitespace to :all, and have all elements have their whitespace compressed.
You may also use the tag :respect_whitespace, which flip-flops the behavior. If you use :respect_whitespace for one or more tags, only those elements will have their whitespace respected; all other tags will have their whitespace compressed.
REXML does some automatic processing of entities for your convenience. The processed entities are &, <, >, ", and '. If REXML finds any of these characters in Text or Attribute values, it automatically turns them into entity references when it writes them out. Additionally, when REXML finds any of these entity references in a document source, it converts them to their character equivalents. All other entity references are left unprocessed. If REXML finds an &, <, or > in the document source, it will generate a parsing error.
bad_source = "<a>Cats & dogs</a>" good_source = "<a>Cats & dogs</a>" doc = REXML::Document.new bad_source # Generates a parse error doc = REXML::Document.new good_source puts doc.root.text # -> "Cats & dogs" doc.root.write $stdout # -> "<a>Cats & dogs</a>" doc.root.attributes["m"] = "x'y\"z" puts doc.root.attributes["m"] # -> "x'y\"z" doc.root.write $stdout # -> "<a m='x'y"z'>Cats & dogs</a>"
Namespaces are fully supported in REXML and within the XPath parser. There are a few caveats when using XPath, however:
source = "<a xmlns:x='foo' xmlns:y='bar'><x:b id='1'/><y:b id='2'/></a>" doc = Document.new source doc.elements["/a/x:b"].attributes["id"] # -> '1' XPath.first(doc, "/a/m:b", {"m"=>"bar"}).attributes["id"] # -> '2' doc.elements["//x:b"].prefix # -> 'x' doc.elements["//x:b"].namespace # -> 'foo' XPath.first(doc, "//m:b", {"m"=>"bar"}).prefix # -> 'y'
The pull parser API is not yet stable. When it settles down, I'll fill in this section. For now, you'll have to bite the bullet and read the PullParser API docs. Ignore the PullListener class; it is a private helper class.
The original REXML stream parsing API is very minimal. This also means that it is fairly fast. For a more complex, more "standard" API, REXML also includes a streaming parser with a SAX2+ API. This API differs from SAX2 in a couple of ways, such as having more filters and multiple notification mechanisms, but the core API is SAX2.
The two classes in the SAX2 API are SAX2Parser and SAX2Listener. You can use the parser in one of five ways, depending on your needs. Three of the ways are useful if you are filtering for a small number of events in the document, such as just printing out the names of all of the elements in a document, or getting all of the text in a document. The other two ways are for more complex processing, where you want to be notified of multiple events. The first three involve Procs, and the last two involve listeners. The listener mechanisms are very similar to the original REXML streaming API, with the addition of filtering options, and are faster than the proc mechanisms.
An example is worth a thousand words, so we'll just take a look at a small example of each of the mechanisms. The first example involves printing out only the text content of a document.
require 'rexml/sax2parser' parser = REXML::SAX2Parser.new( File.new( 'documentation.xml' ) ) parser.listen( :characters ) {|text| puts text } parser.parse
In this example, we tell the parser to call our block for every characters event. "characters" is what SAX2 calls Text nodes. The event is identified by the symbol :characters. There are a number of these events, including :element_start, :end_prefix_mapping, and so on; the events are named after the methods in the SAX2Listener API, so refer to that document for a complete list.
You can additionally filter for particular elements by passing an array of tag names to the listen method. In further examples, we will not include the require or parser construction lines, as they are the same for all of these examples.
parser.listen( :characters, %w{ changelog todo } ) {|text| puts text } parser.parse
In this example, only the text content of changelog and todo elements will be printed. The array of tag names can also contain regular expressions which the element names will be matched against.
Finally, as a shortcut, if you do not pass a symbol to the listen method, it will default to :element_start
parser.listen( %w{ item }) do |uri,localname,qname,attributes| puts attributes['version'] end parser.parse
This example prints the "version" attribute of all "item" elements in the document. Notice that the number of arguments passed to the block is larger than for :text; again, check the SAX2Listener API for a list of what arguments are passed the blocks for a given event.
The last two mechanisms for parsing use the SAX2Listener API. Like StreamListener, SAX2Listener is a module, so you can include it in your class to give you an adapter. To use the listener model, create a class that implements some of the SAX2Listener methods, or all of them if you don't include the SAX2Listener model. Add them to a parser as you would blocks, and when the parser is run, the methods will be called when events occur. Listeners do not use event symbols, but they can filter on element names.
listener1 = MySAX2Listener.new listener2 = MySAX2Listener.new parser.listen( listener1 ) parser.listen( %{ changelog, todo, credits }, listener2 ) parser.parse
In the previous example, listener1 will be notified of all events that occur, and listener2 will only be notified of events that occur in changelog, todo, and credits elements. We also see that multiple listeners can be added to the same parser; multiple blocks can also be added, and listeners and blocks can be mixed together.
There is, as yet, no mechanism for recursion. Two upcoming features of the SAX2 API will be the ability to filter based on an XPath, and the ability to specify filtering on an elemnt and all of its descendants.
WARNING: The SAX2 API for dealing with doctype (DTD) events almost certainly will change.
Michael Neumann contributed some convenience functions for nodes, and they are general enough that I've included. Michael's use-case examples follow:
# Starting with +root_node+, we recursively look for a node with the given # +tag+, the given +attributes+ (a Hash) and whoose text equals or matches the # +text+ string or regular expression. # # To find the following node: # # <td class='abc'>text</td> # # We use: # # find_node(root, 'td', {'class' => 'abc'}, "text") # # Returns +nil+ if no matching node was found. def find_node(root_node, tag, attributes, text) root_node.find_first_recursive {|node| node.name == tag and attributes.all? {|attr, val| node.attributes[attr] == val} and text === node.text } end # # Extract specific columns (specified by the position of it's corrensponding # header column) from a table. # # Given the following table: # # <table> # <tr> # <td>A</td> # <td>B</td> # <td>C</td> # </tr> # <tr> # <td>A.1</td> # <td>B.1</td> # <td>C.1</td> # </tr> # <tr> # <td>A.2</td> # <td>B.2</td> # <td>C.2</td> # </tr> # </table> # # To extract the first (A) and last (C) column: # # extract_from_table(root_node, ["A", "C"]) # # And you get this as result: # # [ # ["A.1", "C.1"], # ["A.2", "C.2"] # ] # def extract_from_table(root_node, headers) # extract and collect all header nodes header_nodes = headers.collect { |header| find_node(root_node, 'td', {}, header) } raise "some headers not found" if header_nodes.compact.size < headers.size # assert that all headers have the same parent 'header_row', which is the row # in which the header_nodes are contained. 'table' is the surrounding table tag. header_row = header_nodes.first.parent table = header_row.parent raise "different parents" unless header_nodes.all? {|n| n.parent == header_row} # we now iterate over all rows in the table that follows the header_row. # for each row we collect the elements at the same positions as the header_nodes. # this is what we finally return from the method. (header_row.index_in_parent+1 .. table.elements.size).collect do |inx| row = table.elements[inx] header_nodes.collect { |n| row.elements[ n.index_in_parent ].text } end end
This isn't everything there is to REXML, but it should be enough to get started. Check the API documentation2 for particulars and more examples. There are plenty of unit tests in the test/ directory, and these are great sources of working examples.