Why simple XML processing is so painful in Java ?
Note: this article is of little interest to learn Scala XML apis, there is far better coverage of them elsewhere in the web, like here and in details here. It's more like a rant against Java, which makes things painful where it should shine...
Nowadays, XML is more or less everywhere, especially when there is data to dispatch between applications, protocols, program languages and other technologies - and no, Json is not (yet ?) as ubiquitous as XML for that.
And still nowadays, parsing simple XML documents in Java is a pain.
Well, actually, I don't speak about complex, normalized documents with defined, huge XSD schemas (perhaps in this situation, you can afford to invest time in Jaxb or Jibx to do it the right way), nor simpler scenario, but where you want to have a real XML/Object mapping - XStream is a kind here, and really does a good job.
I'm talking about kind of XML documents which are more like a database dump, that may be long and with rather deep tree structures, and where you just want to cherry pick some values - of course, in different parts of the tree. You know, when you just want to test ideas, and you have to implement a quick, working thing to see if the overall architecture works [1], and you really don't want to build a full POJO tree to change or erase it the next hour.
That's a kind of place where Java XPath API (jaxp) should shine. But it doesn't. I'm not saying that it's difficult, nor that it doesn't actually work, but that it's painful and you end up with lines and lines and lines of burden (cast, expression compilation, redefinition of higher function than the ones provided by API to do common things, etc) in code that should just expose your intention at first sight.
Well, lets take a super simple example.
Lets say that I have this kind of XML data :
<?xml version="1.0" ?>
<request>
<id>463516</id>
<timestamp>1250240149028</timestamp>
<information>
<person>
<id>463</id>
<firstname>Alex</firstname>
<lastname>Bar</lastname>
<age>34</age>
<gender>male</gender>
<address>
<street>136 W 9th St</street>
<city>Casper</city>
<country>United States</country>
</address>
</person>
</information>
</request>
And I only want to take the timestamp, add the city in a male or female list depending of the gender, and if age > 18, increment the count off adults.
I have a data container that looks like[2]:
public class Data {
public static final String MALE = "male";
public static final String FEMALE = "female";
private final Long timestamp;
private final Map<String, List<String>> stats;
private int adults;
public Data(Long timestamp) {
this.timestamp = timestamp;
this.stats = new HashMap<String, List<String>>();
this.stats.put(MALE, new ArrayList<String>());
this.stats.put(FEMALE, new ArrayList<String>());
}
public void addMale(String city) { this.stats.get(MALE).add(city); }
public void addFemale(String city) { this.stats.get(FEMALE).add(city); }
public Long getTimestamp() { return timestamp; }
public void incAdults() { this.adults = this.adults + 1; }
@Override
public String toString() {
StringBuilder sb = new StringBuilder();
sb.append("time: ").append(this.timestamp).append("\n");
sb.append("female: ");
for(String s : stats.get(FEMALE)) {
sb.append(s).append("; ");
}
sb.append("\n");
sb.append("male: ");
for(String s : stats.get(MALE)) {
sb.append(s).append("; ");
}
sb.append("\n");
sb.append("adults: ").append(adults);
return sb.toString();
}
}
OK, I now this the simplest Java class I came with to implements this logic:
/*
* So, you need a lots of imports,
* and you must have jaxp-api
* somewhere in you path
*/
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class ParseData {
/*
* the "throws Exception" is here to try to remove a
* lot of burden, but of course, don't do that at home !
*/
public static void main(String[] args) throws Exception {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse("data.xml");
XPath xpath = XPathFactory.newInstance().newXPath();
// now, I start to actually do something interesting
Data data = new Data(Long.parseLong(
s(xpath, doc, "//request/timestamp/text()")));
XPathExpression xe = xpath.compile("//request/information/person");
NodeList nodes = (NodeList)xe.evaluate(doc,XPathConstants.NODESET);
for(int i = 0; i < nodes.getLength() ; i++) {
String gender = s(xpath, nodes.item(i), "//gender/text()");
String city = s(xpath, nodes.item(i), "//address/city/text()");
if(Data.MALE.equals(gender.toLowerCase())) {
data.addMale(city);
} else if (Data.FEMALE.equals(gender.toLowerCase())) {
data.addFemale(city);
}
if(Integer.parseInt(s(xpath, nodes.item(i), "//age/text()")) >= 18) {
data.incAdults();
}
}
System.out.println(data);
}
/*
* Why do I have to do that ? Even it it's two lines,
* I just let you imagine the look of the
* main loop without this function...
* But why the Xpath API doesn't have the four of five
* functions alike defined for each XPathConstants types ?
* Before actually begin to use the API, I have to redefine it...
*/
public static String s(XPath xpath, Object root, String expr) throws Exception {
XPathExpression xe = xpath.compile(expr);
return (String)xe.evaluate(root, XPathConstants.STRING);
}
}
As you can see, there is a lots of type cast, and I quickly loose what I'm looking for, even in a so simple class with so little cases and data to retrieve.
So, at the end, what did I do ? Just use the Scala XML library. Same logic, in a scala class:
import scala.xml.XML
object ScalaParseData {
def main(args:Array[String]) {
val doc = XML.load("data.xml")
val data = new Data((doc\\"request"\"timestamp" text).toLong)
for(person <- (doc\\"request"\"information"\"person")) {
val city = person\"address"\"city" text
(person\"gender" text).toLowerCase match {
case Data.MALE => data.addMale(city)
case Data.FEMALE => data.addFemale(city)
}
if((person\"age" text).toInt >= 18) data.incAdults
}
println(data)
}
}
In both case, the same output is printed in stdout:
time: 1250240149028
female:
male: Casper;
adults: 1
Even if you never look at Scala, you understand what is the global logic, what piece of data are looked for. Nothing to add :)
[1]: ok, the question here is: is Java the right language for that ? Well, it seems that the answer, for whose who still had doubts, is DEFINITLY NOT.
[2] OK, even the data container is complex, and Java really miss Tuples structure to prototype efficiently. In such a process, you just don't want to spend time writing POJOs and POJOs and POJOs that are only, meaningless container for two string lists and a long, even if you IDE does 90%of the job. If you are interested by more efficient data structure for Java, you should go and look for Functional Java, that's a really cool project - and at least, you will have tuples (named "products" here), function class (to not have to define again and again that
Filter<E> { boolean filter(E element); }
class)
3 comments:
vtd-xml may be worth looking into
http://vtd-xml.sf.net
That was a project I didn't know, thanks for the link.
But as I see the examples (ex: http://vtd-xml.sourceforge.net/codeSample/RSSReader2.java ), I think it's not exactly what I was hopping for. I will keep it in mind if I need a really efficient XML processor, when performance matter more than readability / ease of use (and so, not for a proto).
If you have any questions on usability, check out the following..
http://myvtdpasrer.blogspot.com/2009/07/vtd-parser-first-look.html
Post a Comment