REXML - FAQ

REXML is hanging while parsing one of my XML files.

Your XML is probably malformed. Some malformed XML, especially XML that contains literal '<' embedded in the document, causes REXML to hang. REXML should be throwing an exception, but it doesn't; this is a bug. I'm aware that it is an extremely annoying bug, and it is one I'm trying to solve in a way that doesn't significantly reduce REXML's parsing speed.

I'm using the XPath '//foo' on an XML branch node X, and keep getting all of the 'foo' elements in the entire document. Why? Shouldn't it return only the 'foo' element descendants of X?

No. XPath specifies that '/' returns the document root, regardless of the context node. '//' also starts at the document root. If you want to limit your search to a branch, you need to use the self:: axe. EG, 'self::node()//foo', or the shorthand './/foo'.

I want to parse a document both as a tree, and as a stream. Can I do this?

Yes, and no. There is no mechanism that directly supports this in REXML. However, aside from writing your own traversal layer, there is a way of doing this. To turn a tree into a stream, just turn the branch you want to process as a stream back into a string, and re-parse it with your preferred API. EG: pp = PullParser.new( some_element.to_s ). The other direction is more difficult; you basically have to build a tree from the events. REXML will have one of these builders, eventually, but it doesn't currently exist.

Why is Element.elements indexed off of '1' instead of '0'?

Because of XPath. The XPath specification states that the index of the first child node is '1'. Although it may be counter-intuitive to base elements on 1, it is more undesireable to have element.elements[0] == element.elements[ 'node()[1]' ]. Since I can't change the XPath specification, the result is that Element.elements[1] is the first child element.

Why isn't REXML a validating parser?

Because validating parsers must include code that parses and interprets DTDs. I hate DTDs. REXML supports the barest minimum of DTD parsing, and even that isn't complete. There is DTD parsing code in the works, but I only work on it when I'm really, really bored. Rumor has it that a contributor is working on a DTD parser for REXML; rest assured that any such contribution will be included with REXML as soon as it is available.

I'm trying to create an ISO-8859-1 document, but when I add text to the document it isn't being properly encoded.

Regardless of what the encoding of your document is, when you add text programmatically to a REXML document you must ensure that you are only adding UTF-8 to the tree. In particular, you can't add ISO-8859-1 encoded text that contains characters above 0x80 to REXML trees -- you must convert it to UTF-8 before doing so. Luckily, this is easy: text.unpack('C*').pack('U*') will do the trick. 7-bit ASCII is identical to UTF-8, so you probably won't need to worry about this.

How do I get the tag name of an Element?

You take a look at the APIs, and notice that Element includes Namespace. Then you click on the Namespace link and look at the methods that Element includes from Namespace. One of these is name(). Another is expanded_name(). Yet another is prefix(). Then, you email the author of rdoc and ask him to extend rdoc so that it lists methods in the API that are included from other files, so that you don't have to do all of that looking around for your method.

FAQ

3.1.7.3