Ways to Rome: Processing XML with Perl

Original version by Ingo Macherius, < macherius@gmd.de>
Maintained by Michel Rodriguez, < m.v.rodriguez@ieee.org>
Version: 2.1: 2002-09-17


1. Contributors


One of Perl's key features is: things can be done more then one way. This holds when processing XML using Perl. This article solves a simple task again and again using different, XML-related CPAN modules and programming styles. The candidates are:

Version table
Method CPAN Code Remark
Perl regular expressions   ex_ps_regexp dont do it this way!
XML::Parser XML::Parser ex_ps_parser using Handlers
XML::Parser   ex_ps_subs using Subs-style
XML::Parser::Lite XML::Parser::Lite ex_ps_lite included in SOAP::Lite
XML::DOM XML::DOM ex_ps_dom2 using perlish shortcuts to the DOM
XML::XQL XML::XQL ex_ps_xql versions older than 0.59 won't work
XML::Twig XML::Twig ex_ps_twig  
XML::PYX XML::PYX ex_ps_pyx uses only the pyx tool, see also XML::TiePYX
XML::XPath XML::XPath ex_ps_xpath  
XML::DT XML::DT ex_ps_dt  
XML::TokeParser XML::TokeParser ex_ps_tokeparser XML equivalent of HTML::TokeParser
XML::Grove XML::Grove ex_ps_grove in libxml-perl
XML::LibXML XML::LibXML ex_ps_libxml based on libxml2
XML::LibXML::SAX XML::LibXML::SAX ex_ps_sax_libxml
XML::SAX::Expat XML::SAX::Expat ex_ps_sax_expat
XML::XSLT XML::XSLT ex_ps_xslt the style sheet is in ex_ps_xslt.xslt
XML::LibXSLT XML::LibXSLT ex_ps_libxslt based on libxslt

You can get the latest versions of all those modules (and more!) from CPAN. A good place to get some documentation is the XML Documentation page at the University of Winnipeg.

This article was first written for a talk on the German Perl workshop 1.0 on February 17th, 1999 in Bonn, then further developed as new modules were created. The focus in on the code examples, not the explanatory text. All code is tested and should work "cut-and-paste" if you have the above modules installed and a copy of REC-XML in your working folder. ways_to_rome.tar.gz includes all of the exemples, the REC, the original XML for this article and the generation script.

2. The Task

The task is to filter the REC-xml-19980210.xml specification for grammar productions. They are contained in special markup, a typical one that looks like this:

...
<prod id="NT-PubidLiteral"><lhs>PubidLiteral</lhs>
<rhs>'"' <nt def='NT-PubidChar'>PubidChar</nt>* 
'"' 
| "'" (<nt def='NT-PubidChar'>PubidChar</nt> - "'")* "'"</rhs>
</prod>
...

So a grammar rule consists of a production, which consists of a right-hand-side and a left hand side. In the right-hand-side markup for describing hyperlinks between productions may be contained, so it is mixed content. The DTD fragment for this is:

<!ELEMENT prod (lhs, (rhs, (com|wfc|vc)*)+)>
<!--    ID attribute:
        The production must have an ID so that cross-references
        (specref) and mentions of nonterminals (nt) can link to
        it. -->
<!ATTLIST prod
        %common-idreq.att;>

<!ELEMENT lhs (#PCDATA)>
<!ATTLIST lhs %common.att;>

<!ELEMENT rhs (#PCDATA|nt|xnt|com)*>
<!ATTLIST rhs %common.att;>

The DTD can be downloaded from http://www.w3.org/XML/1998/06/xmlspec-v21a.dtd.

The example is a little trickier than it appears at first:

The example programs will all produce a standard EBNF representation of the 89 productions contained in the XML specification, like this:

...
[11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
[12] PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"
[13] PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]
[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
[15] Comment ::= '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
[16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
...

3. The Code

So, here we go with the code ...

3.1 Perl regular expressions

As the name "Practical Extraction and Report Language" implies, Perl was used for extraction of data long time before the advent of XML. The tool of choice for this task are regular expressions. The code below just follows the well known trail. It slurps in the file and iterates over regular expressions until the grammar text is extracted. Please note that the code heavily exploits the "non-greedy" option for regular expressions.

This is a example of how NOT to process XML using Perl. Please don't use regular expressions on XML, in the very short run you will be bitten. This was by far the most painful example to write and although it does the job it will break for the next version of the RFC. Entity resolution especially is much easier if you use a parser.

[download ex_ps_regexp]

#!/usr/bin/perl -w
# using Perl Regular Expressions
# Author: Ingo Macherius <macherius@gmd.de>
# modified by Michel Rodriguez <mirod@xmltwig.com>
use strict;

open( REC, "/article/ways_to_rome/ltREC_xml_19980210.xml") or die $!;

# slurp the whole document into memory
undef $/; 
my $doc=<REC>;

# remove comments NOW
# fails for <!-- in an attribute?
$doc =~ s{<!--.*?-->}{}sg;

# a semi generic way to get the entities
# fails miserably for entities using other entities
my %ent= ( amp  => '&', quot => '"', apos => "'", lt   => '<', gt   => '>',
	   xmlpio => "'<?xml'",             # uses &lt;
	   hcro   => "&#x",                 # uses &amp
	   nbsp   => ' ', '#160' => ' ',    # def is commented out in the REC   
         );

while( $doc=~ /<!ENTITY\s+(\w+)\s+(["'])(.*?)\2\s*>/g)
  { $ent{$1} ||= $3; } # use ||= to avoid redefining entities

my $i = 0;
foreach ( $doc =~ m{<prod.*?>.*?</prod>}gs ) {
	my( $lhs) = m{<lhs>(.*?)</lhs>} or die "no lhs in prod $_";;

        my $rhs='';
	while( m{<rhs.*?>(.*?)</rhs>}sg)
          { $rhs .= $1; }

	$rhs =~ s{</?nt.*?>}{}sg;          # remove nt tags
	#$rhs =~ s{<com.*?>.*?</com>}{}sg; # remove com elements, not needed here

	$i++;
	print clean( "[$i] $lhs ::= $rhs"), "\n";
}

sub clean { 
        my( $string)= @_;
        # yes, you have to replace the entities yourself
        $string=~ s{&(\w+);}{ $ent{$1} || die "unknown entity $1"}eg;
        $string =~ s{\xc2\xa0}{ }g; # weird character in source
	$string =~ s{\s+}{ }g; $string =~ s{\s$}{};
	return $string;
}

3.2 XML::Parser and Handlers

This solution rides XML::Parser in the most simple possible form, using handlers. This is very close to the original Expat API, so the code should be fast, it used to be even faster than the regular expressions in solution 2.1. With the increase of features offered by the latest versions of XML::Parser it is now slower, but it is still the fastest way to do safe XML processing. This is because we are actually processing an event stream. No in-memory representation of the XML document or the results is ever built. The control flow in event based programming is weird and repetetive, so this API is not suited for the casual programmer. People used to think in state machines and automata, however, will be very happy using handlers.

[download ex_ps_parser]

#!/usr/bin/perl -w
# Using XML::Parser and Handlers
# Author: Ingo Macherius <macherius@gmd.de>
# updated by Michel Rodriguez <mirod@xmltwig.com>
use strict;

use XML::Parser;

# variables used to track the state of the parser
my( $in_lhs, $lhs, $in_rhs, $rhs, $i);

my $parser = XML::Parser->new(Handlers => { Start => \&tag_start,
	  End   => \&tag_end,
	  Char  => \&characters,
	});

$parser->parsefile('/article/ways_to_rome/REC_xml_19980210.xml');

sub tag_start {
	my ($xp, $el) = @_;

	if ($el eq 'rhs')     { $in_rhs = 1; } 
	elsif ($el eq 'lhs')  { $in_lhs = 1; }
	elsif ($el eq 'prod') { $rhs = ''; $lhs = ''; }
}

sub tag_end {
	my ($xp, $el) = @_;
	if ($el eq 'rhs')     { $in_rhs = 0; } 
	elsif ($el eq 'lhs')  { $in_lhs = 0; }
	elsif ($el eq 'prod') { print_production( ++$i, $lhs, $rhs); }
}

sub characters {
	my ($xp, $txt) = @_;
	if ($in_lhs)    { $lhs .= $txt }
	elsif ($in_rhs) { $rhs .= $txt }
}

sub print_production {
        my( $i, $lhs, $rhs)= @_;
	my $prod = "[$i] $lhs ::= $rhs";
	print clean( $prod) . "\n";
}


sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}g;
        return $string;
}

Note that this first version is fast but quite crude, it uses a lot of global variables for example. A cleaner version, below, using closures to pass the state to handlers and using the in_context method to identify the element in which characters are, turns out to be much slower! This is mostly due to the cost of calling the in_context method.

[download ex_ps_parser_2]

#!/usr/bin/perl -w
# Using XML::Parser and Handlers
# Author: Ingo Macherius <macherius@gmd.de>
use strict;
use XML::Parser;

my $state={}; # holds the "global" values like rhs, lhs and counter

my $parser = XML::Parser->new( 
               Handlers => 
	        { Start => sub { tag_start(  $state, @_) } ,
	          End   => sub { tag_end(    $state, @_) } ,
	          Char  => sub { characters( $state, @_) },
	          Init  => sub { init(       $state, @_) },
	        });

$parser->parsefile('/article/ways_to_rome/REC_xml_19980210.xml');

sub tag_start {
	my ($state, $p, $el) = @_;
	if ($el eq 'prod') { $state->{rhs} = ''; $state->{lhs} = ''; }
}

sub tag_end {
	my ($state, $p, $el) = @_;
	if ($el eq 'prod') {
               $state->{counter}++;
	       my $prod= "[$state->{counter}] $state->{lhs} ::=  $state->{rhs}";
	       print clean( $prod), "\n";
        }
}

sub characters {
	my ( $state, $p, $txt) = @_;
	if    ($p->within_element( 'lhs')) { $state->{lhs} .= $txt }
	elsif ($p->within_element( 'rhs')) { $state->{rhs} .= $txt }
}

sub init {
        my( $state, $p)= @_;
        $state->{counter} = 0;
}

sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}g;
        return $string;
}

3.3 XML::Parser and the Subs-style

The "Subs"-style of XML::Parser is syntactic sugar for the bare handler interface. It auto-generates handlers for element-events and maps them to Perl subroutines. So the logic of example 2.3 is used unchanged, but as the control-flow is managed by the module, all those if statements have vanished. Characters must still be handled by your own handler.

[download ex_ps_subs]

#!/usr/bin/perl -w
# Using XML::Parser and the Subs-style
# Author: Ingo Macherius <macherius@gmd.de>
# modified by Michel Rodriguez <mirod@xmltwig.com>

use XML::Parser;
use strict;

my( $in_lhs, $lhs, $in_rhs, $rhs, $i);

my $parser = XML::Parser->new('Style' => 'Subs' );
   $parser->setHandlers('Char', \&characters);

$parser->parsefile('/article/ways_to_rome/REC_xml_19980210.xml');

sub rhs   { $in_rhs = 1 }
sub lhs   { $in_lhs = 1 }
sub prod  { $rhs = ''; $lhs = ''; }
sub rhs_  { $in_rhs = 0; }
sub lhs_  { $in_lhs = 0; }
sub prod_ { $i++;
            my $prod = "[$i] $lhs ::= $rhs";
            print clean( $prod) . "\n";
          }

sub characters {
        my ($xp, $txt) = @_;
        if    ($in_lhs) { $lhs .= $txt; }
        elsif ($in_rhs) { $rhs .= $txt; }
}

sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}g;
        return $string;
}

3.4 XML::Parser::Lite

XML::Parser::Lite is included in SOAP::Lite. It uses regexps to parse the subset of XML used in SOAP.

I consider using XML::Parser::Lite to be quite dangerous when not used to parse SOAP messages. It is not a complete XML parser, which means that it is likely to choke, maybe silently, on valid XML. It also lacks some features like entity expansion (a real problem in this code as I had to go and manually add a couple of entities that use embedded entities).

That said the code was easy to write, it is nearly identical to he one for XML::Parser, with the exception that I could not use regexps within handlers (another limitation, apparently due to a bug in Perl fixed in 5.8.0) and that I just built the list of production rules and cleaned it afterwards.

XML::Parser::Lite is fast (but dangerous!), it is even faster than XML::Parser in this example.

[download ex_ps_lite]

#!/usr/bin/perl -w

# Using XML::Parser::Lite
# Author: Michel Rodriguez <mirod@xmltwig.com>
#         based on a stub by Josh Narins <josh@large.com>
#         and help from Jeff Gleixner 




use strict;
use XML::Parser::Lite;

# we need to replace the entities as XML::Parser::Lite does not do it
my %ent= ( amp  => '&', quot => '"', apos => "'", lt   => '<', gt   => '>', 
	   xmlpio => "'<?xml'",             # uses &lt;
	   hcro   => "&#x",                 # uses &amp
	   nbsp   => ' ', '#160' => ' ',    # def is commented out in the REC 
         );

open( REC, "/article/ways_to_rome/ltREC_xml_19980210.xml") or die $!;
my $doc;
{
 local $/ = undef;
 $doc=<REC>;
}
close REC;

# load entities, breaks for entities using other entities
while( $doc=~ /<!ENTITY\s+(\w+)\s+(["'])(.*?)\2\s*>/sg)
  { $ent{$1} ||= $3; } # use ||= to avoid redefining entities


my $flags={};
my $parser = new XML::Parser::Lite  Handlers => {
        Start => sub {  
            my ( $p, $el) = @_;
            if    ( $el eq 'rhs' ) { $flags->{in_rhs}=1 }
            elsif ( $el eq 'lhs' ) { $flags->{in_lhs}=1 }
        },
        Char  => sub {
            my ( $p, $txt) = @_;
            if    ($flags->{in_lhs}) {$flags->{lhs} .= $txt}
            elsif ($flags->{in_rhs}) {$flags->{rhs} .= $txt}
        },
        End   => sub {
            my ( $p, $el) = @_;
            if    ( $el eq 'rhs' ) { $flags->{in_rhs}=0; }
            elsif ( $el eq 'lhs' ) { $flags->{in_lhs}=0; }
            elsif ( $el eq 'prod') { push @{$flags->{production}}, 
                                          production(++$flags->{i},$flags->{lhs},$flags->{rhs});
                                     $flags->{lhs}= ''; $flags->{rhs}= '';
                                   }
        }
};
$parser->parse( $doc);

foreach my $prod ( @{$flags->{production}})
  { print clean( $prod), "\n"; }
    
sub production {
  my ($i,$lhs,$rhs) = @_;
  return "[$i] $lhs ::= $rhs";
}


sub clean { 
        my( $string)= @_;

	# replace entities
        $string=~ s{&(.*?);}{$ent{$1} or die "unknown entity $1\n"}eg; 
 
  	$string =~ s{\xc2\xa0}{ }g; # weird characters in the original document
        $string =~ s{\s+}{ }g; $string=~ s{\s$}{};
        return $string;
}

3.5 XML::DOM

The DOM is W3Cs standard API for XML. The Perl implementation (well, any implementation of the DOM I know to be exact) offers a bunch of convenience methods. In the example below none of them are used, just to keep things clean. Although DOM looks sometimes quite un-Perlish, especially in its method names, it has become very popular (contrary to what the original version of this article predicted ;--).

As XML::DOM has to build an in memory representation of the document, there is a huge impact on execution time. The example takes 4 times longer then the slowest event-based approach. There may be optimization to reduce this, but building a DOM will cost.

The first example uses features from the Perl implementation of the DOM, that stray from the strict Java API defined by the W3C. The DOM implementation in Perl offers some extra features that make it more legible (at the cost of deviating slightly from the standard). In the following example we take advantage of the fact that getElementsByTagName and getChildNodes both return a list of nodes in a list context, thus simplifying the code a lot.

[download ex_ps_dom2]

#!/bin/perl -w

# Using XML::DOM
# Authors: Ingo Macherius <macherius@gmd.de>
# modified by Michel Rodriguez <mirod@xmltwig.com>
# with input from the XML::DOM author Enno Derksen <enno@erols.com>
# This example uses features from the Perl implementation of the DOM,
# namely the fact that you can get an array from the getElementsByTagName
# method, as in @nodes = $doc->getElementsByTagName ("prod")

use strict;
use XML::DOM;

my $parser = XML::DOM::Parser->new;
my $doc = $parser->parsefile ("/article/ways_to_rome/REC_xml_19980210.xml");

my @nodes = $doc->getElementsByTagName ("prod");

my $i=0;

foreach my $node(@nodes) {
    my $lhs = $node->getElementsByTagName("lhs")->item(0);
    my @rhs = $node->getElementsByTagName("rhs");

    $i++;
    my $prod=   "[$i] "
              . $lhs->getFirstChild->getNodeValue()
              . " ::= "
              . rhs(@rhs);
    print clean( $prod), "\n";
}

sub rhs {
    my $text;
    foreach my $rhs (@_)
    {
     my @nodes = $rhs->getChildNodes();

     foreach my $node (@nodes ) {
         if ($node->getNodeType() == XML::DOM::Node::ELEMENT_NODE()) {
           $text .= $node->getFirstChild()->getNodeValue()
             unless( $node->getFirstChild()->getNodeName eq '#comment');
     } else {
        $text .= $node->getNodeValue()
          unless( $node->getNodeName eq '#comment');
        }
      }
    }
    return $text;
}

sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}g;
        return $string;
}

The next version is "pure" DOM, using only methods that exist in the Java API.

[download ex_ps_dom]

#!/usr/bin/perl -w

# Using XML::DOM
# Authors:     Ingo Macherius <macherius@gmd.de>
#          and Michel Rodriguez <m.v.rodriguez@ieee.org>
# with input from the XML::DOM author Enno Derksen <enno@erols.com>

use strict;
use XML::DOM;

my $parser = XML::DOM::Parser->new;
my $doc = $parser->parsefile ("/article/ways_to_rome/REC_xml_19980210.xml");

my $node_list = $doc->getElementsByTagName ("prod");
my @nodes;
my $len_node_list = $node_list->getLength();
for (my $j = 0; $j < $len_node_list; $j++) {
  push @nodes, $node_list->item($j); }

my $i=0;

foreach my $node(@nodes) {
    my $lhs = $node->getElementsByTagName("lhs")->item(0);

    my $rhs_list = $node->getElementsByTagName("rhs");
    my $len = $rhs_list->getLength();
    my @rhs;
    for (my $j = 0; $j < $len; $j++) {
      push @rhs, $rhs_list->item($j);
   }

    $i++;
    my $prod=   "[$i] "
              . $lhs->getFirstChild->getNodeValue()
              . " ::= "
              . rhs(@rhs);
    $prod= clean( $prod);
    print $prod, "\n";
}

sub rhs {
    my $text;
    foreach my $rhs (@_)
    {
     my $node_list = $rhs->getChildNodes();
     my $len_node_list = $node_list->getLength();
     my @nodes;
     for (my $j = 0; $j < $len_node_list; $j++) {
       push @nodes, $node_list->item($j); }

     foreach my $node (@nodes ) {
         if ($node->getNodeType() == XML::DOM::Node::ELEMENT_NODE()) {
           $text .= $node->getFirstChild()->getNodeValue()
             unless( $node->getFirstChild()->getNodeName eq '#comment');
     } else {
        $text .= $node->getNodeValue()
          unless( $node->getNodeName eq '#comment');
        }
      }
    }
    return $text;
}

sub clean { 
        my( $string)= @_;
	$string =~ s/\xc2\xa0/ /sg; # weird characters in the original document
        $string =~ s/\s+/ /g; $string=~ s{^\s}{}g; $string=~ s{\s$}{}g;
        return $string;
}

3.6 XML::XQL

XML::XQL implements an XML query language once proposed to the W3C. It looks very much like a draft for XPath. It is quite powerful, and the code is quite simple. Alas the language was never adopted by the W3C, and the module is not maintained any more. It is still usable though, being sometimes easier to use than XML::DOM (on which it is based) and only slightly slower.

A nice feature of XML::XQL is the fact that all the results are XML::DOM nodes, so the results of a query can be processed further using DOM methods.

[download ex_ps_xql]

#!/usr/bin/perl -w
# Using XML::XQL
# Author: Michel Rodriguez <mirod@xmltwig.com>
# note that this code behaves slightly differently than the rest of
# the examples, some spaces are added in place. Patch welcome ;--)

use strict;
use XML::XQL;
use XML::XQL::DOM;

my $parser = XML::DOM::Parser->new();
my $doc = $parser->parsefile ("/article/ways_to_rome/REC_xml_19980210.xml");

# get all the prods
my @prod=  $doc->xql('//prod');

# define the queries used to get the lhs and rhs from the prod
my $lhs_query = XML::XQL::Query->new (Expr => "./lhs");
my $rhs_query = XML::XQL::Query->new (Expr => "./rhs");


my $i=0;
foreach my $prod (@prod) {
        my @lhs= $lhs_query->solve( $prod); # solve the query in $prod context
        my $lhs= $lhs[0]->xql_toString;     # only one lhs

        my @rhs= $rhs_query->solve( $prod); 
        my $rhs= join( '', map { $_->xql_toString } @rhs); # maybe several rhs

        $i++;
        my $prod_text= "[$i] $lhs ::= $rhs";
        print clean( $prod_text), "\n";
}

sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{};
        return $string;
}

3.7 XML::Twig

XML::Twig is an hybrid module, that uses a push-pull model to process XML. As it parses the document it loads it into a tree that can be purged when parts of it are no longer needed. Handlers are called for each element once they have been parsed and their sub-tree has been built. Furthermore you don't need to build the entire tree but you can limit the processing to the actual set of elements you are really interested in. This makes this exemple an optimal one for XML::Twig. [Full disclosure from the maintainer of the article: I am the author of XML::Twig ;--]

XML::Twig allows you to define a handler for the prod element, which receives the element subtree including the lhs and rhs elements. The text of those elements is retrieved using the field and text methods. The rest of the document is ignored during the parse.

[download ex_ps_twig]

#!/usr/bin/perl -w
# Using XML::Twig
# Author: Michel Rodriguez <m.v.rodriguez@ieee.org>

use strict;
use XML::Twig;

my $i=0;

my $twig = XML::Twig->new(
                twig_roots =>          # will build the tree only for prod
                  { prod => \&prod },  # elements, and call the prod sub 
                keep_spaces => 1,      # spaces will be dealt with by clean()
               );

$twig->parsefile('/article/ways_to_rome/REC_xml_19980210.xml');


sub prod 
  { my( $twig, $prod)= @_;
    my $lhs= $prod->field( 'lhs');
    my $rhs= join '', map {$_->text} $prod->children( 'rhs');

    $i++;
    my $prod_text = "[$i] $lhs ::= $rhs";
    print clean( $prod_text) . "\n";
 }


sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}g;
        return $string;
}

3.8 XML::PYX

XML::PYX generates a line-oriented output from the XML source.The module itself is not used but the XML file is filtered by the pyx tool.

PYX is really convenient for this kind of task, that require extracting data from an XML document. It allows programers to use familiar line-oriented techniques without bothering about the syntax of XML.

[download ex_ps_pyx]

#!/usr/bin/perl -w

# Using XML::PYX
# Author: Michel Rodriguez <m.v.rodriguez@ieee.org>

use strict;

my $i=0;
my $write=0;
my $first_rhs;
my $prod='';

# REC gets the output of pyx REC-xml-19980210.xml
open( REC, 'pyx REC-xml-19980210.xml |');

while( <REC>)
  { chomp;
    if(    m/^\(prod$/)     { $i++;
                              $prod= "[$i] "; 
                              $first_rhs=1;
                            }
    elsif( m/^\(lhs$/)      { $write=1 ;  }
    elsif( m/^\)lhs$/)      { $write=0 ;  }
    elsif( m/^\(rhs$/)      { $write=1 ;   
                              $prod .= " ::= " if( $first_rhs); 
                              $first_rhs=0;          
                            }
    elsif( m/^-/ && $write) { if( /^-\\n/) { $prod .= ' ';            }
                              else         { $prod .= substr( $_, 1); }
                            }

    elsif( m/^\)rhs$/)      { $write=0 ; }
    elsif( m/^\)prod$/)     { print clean( $prod), "\n";              }
  }

close REC;

sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g;
        $string=~ s{\s$}{}g;
        return $string;
}                               

3.9 XML::DT

XML::DT provides stream-oriented processing in an Omnimark-like way.

XML::DT allows a script to set handlers on elements. A global variable, $c receives the current state of the parser, depending on how the handler is set. In this code the -type => {prod => MMAPON("rhs", "com","wfc","vc" )} option means that prod children are mapped by element names, with their values being stored in a list (retrieved by the @{$c->{rhs}} part in the handler.

[download ex_ps_dt]

#!/usr/bin/perl -w
# Using XML::DT
# Author: Jose Joao Dias de Almeida <jj@di.uminho.pt>
# modified by Michel Rodriguez <mirod@xmltwig.com>

use strict;
use XML::DT ;
my $i=0;

my %handler=(
     -type => { prod => MMAPON("rhs","com","wfc","vc") },
     -default => sub{"$c"},
     com  => sub{""},                            #remove comments
     prod => sub{ $i++;
         my $prod= "[$i] $c->{lhs} ::= ". join("" , @{$c->{rhs}});
         print clean( $prod), "\n";
       }
     );

dt('/article/ways_to_rome/REC_xml_19980210.xml',%handler);  

sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}g;
        return $string;
}      

XML::DT leads to very compact, if somehow cryptic (but I have no Omnimark programming experience) code.

3.10 XML::XPath

XML::XPath is an implementation of W3C's XPATH language. After writing most of the other examples it was really easy to build this one, even without knowing much about XPATH. It is definitely slow but very powerful.

[download ex_ps_xpath]

#!/usr/bin/perl -w
# Author: Michel Rodriguez <m.v.rodriguez@ieee.org>
 
use strict;
use XML::XPath;
use XML::XPath::XMLParser;
 
my $xpp = XML::XPath::XMLParser->new(filename => '/article/ways_to_rome/REC_xml_19980210.xml');
my $tree = $xpp->parse;                            # build the XPATH tree structure
my $xp = XML::XPath->new(context => $tree);        # xp is the xpath object

my @prods = $xp->find("//prod")->get_nodelist; 

my $i=0;
foreach my $prod (@prods) 
#{ my $lhs = $xp->find('string(./lhs/text())', $prod);
  { my $lhs = $xp->find('./lhs', $prod)->to_literal;
    my $rhs = $xp->find('./rhs', $prod)->to_literal;

    $i++;
    my $prod_text= "[$i] $lhs ::= $rhs\n";
    $prod_text= clean( $prod_text);
    print $prod_text, "\n";
  }

sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}g;
        return $string;
}

3.11 XML::Grove

XML::Grove is a tree-based module that allows the creation of visitors on a document. Visitors are defined in a separate Perl package, with fixed function names such as visit_document, visit_element... or names derived from generic names such as visit_name_prod, called when an element prod is visited. The module selects which function is called, depending on the nature (and sometimes element name) of the node being visited. Access to the properties is done through a regular Perl hash reference, as in $characters->{Data}.

At first it is a bit surprising to use (at least for someone who does not regularly use visitors), but it turned out to be quite easy to write the example.

The result is slow, as with most of the tree-base modules.

[download ex_ps_grove]

#!/bin/perl -w

# author Michel Rodriguez <m.v.rodriguez@ieee.org>

use strict;


# Basic parsing and grove building
use XML::Grove::Builder;
use Data::Grove::Visitor;
use XML::Grove::AsString;
use XML::Parser::PerlSAX;

my $store= 0;
my $i=0;
my $prod;

my $grove_builder = XML::Grove::Builder->new;
my $parser        = XML::Parser::PerlSAX->new ( Handler => $grove_builder );
my $grove;

my $visitor = new TitleVisitor;              # create the visitor

# create the grove object
$grove= $parser->parse ( Source => { SystemId => '/article/ways_to_rome/REC_xml_19980210.xml' }  );  
$grove->accept( $visitor);                   # visit the grove


# This package is used by the $grove->accept call

package TitleVisitor;

sub new 
  { my $class = shift;
    return bless {}, $class;
  }

sub visit_document                            # first method called
  { my ($self, $grove)= @_;
    $grove->children_accept_name ($self);     # visit all children of the doc
  }

sub visit_element                             # called if no visit_element_<tag>exists for the element
  { my( $self, $element)= @_;                
    $element->children_accept_name ($self);   # visit children
  }

sub visit_characters 
  { my( $self, $characters)= @_;              # called for characters in elements
    $prod .= $characters->{Data} if( $store); # store the text when required
  }

sub visit_name_prod                           # called for elements prod
  { my( $self, $element )= @_;
    $i++;
    $prod= "[$i] ";
    $element->children_accept_name ($self);   # visit children, updates $prod
    print clean( $prod), "\n";
}

sub visit_name_lhs                            # called for elements lhs
  { my( $self, $element )= @_;
    $store= 1;                                # store text in lhs
    $element->children_accept_name ($self);
    $store= 0;
    $prod .= " ::= ";
  }

sub visit_name_rhs                            # called for elements lhs
  { my( $self, $element )= @_;
    $store= 1;                                # store text in lhs
    $element->children_accept_name ($self);
    $store= 0;
  }

sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}g;
        return $string;
}

3.12 XML::TokeParser

XML::TokeParser is interesting for 2 main reasons:

XML::TokeParser's most important method is get_token, which returns the next token as a data structure (an arry reference), with the type of token (the first element of the array) and additional values depending on the type of token.

XML::TokeParser is extremelly well-suited for the kind of task at hand: the code is very simple, especially as the get_text method gives the text of an element including the text of sub-elements.

The benchmark shows it is quite slow for a non-tree module.

[download ex_ps_tokeparser]

#!/usr/bin/perl -w
# Using XML::TokeParser
# Author: D.H. http://search.cpan.org/author/PODMASTER
# additional comments by Michel Rodriguez

use strict;

use XML::TokeParser;

my $file = '/article/ways_to_rome/REC_xml_19980210.xml';
my $i = 0;
my $p = XML::TokeParser->new($file);


my $Ret = "";

# go through the document, reading tokens
while(defined(my $t = $p->get_token() )){
    if($t->[0] eq 'S' and $t->[1] eq 'lhs') {
        # found the start tag for an 'lhs' element: get its text
        $i++;
        $Ret = join '', "[$i] ", $p->get_text('/lhs'), " ::= ";
    }elsif( $t->[0] eq 'S' and $t->[1] eq 'rhs'){
        # start tag for a 'rh's element: get its text
        $Ret .= $p->get_text('/rhs');
    }elsif($t->[0] eq 'E' and $t->[1] eq 'prod'){
        # end tag for a 'prod' element: output the rule 
        print clean($Ret),"\n";
        $Ret = "";
    }
}

undef $Ret;
undef $p;

## mirod already did this, so I'm borrowing

sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}g;
        return $string;
}

3.13 XML::LibXML

XML::LibXML is the Perl interface to Gnome's libxml2 library (which needs to be installed). It offers both DOM and SAX interfaces and it is very fast! DOM and SAX interfaces mean that there is a reference definition for the API of the module, contrary to most others, and especially to XML::Parser (and to the expat library it is based on).

The fact that XML::LibXML is not based on XML::Parser, is also important as this module is not well maintained these days.

The first example uses the DOM mode and was very easy to write and looks exactly like the one XML::XPath, as XML::LibXML includes a SAX engine, with just a couple of different method calls.

XML::LibXML is fast: despite building the whole DOM for the document it still outperforms even the fastest XML::Parser based code.

This is definitely the fastest and one of the easiest to use module available.

[download ex_ps_libxml]

#!/usr/bin/perl -w
# Author: Michel Rodriguez <m.v.rodriguez@ieee.org>
 
use strict;

use XML::LibXML;

my $parser = XML::LibXML->new();
my $doc = $parser->parse_file( '/article/ways_to_rome/REC_xml_19980210.xml');
 
my @prods = $doc->findnodes("//prod");

my $i=0;

foreach my $prod (@prods) 
  { my $lhs =  $prod->findvalue('./lhs');
    my $rhs =  $prod->findvalue('./rhs');

    $i++;
    my $prod_text= "[$i] $lhs ::= $rhs";
    print clean( $prod_text), "\n";
  }

sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}g;
        return $string;
}

3.14 XML::LibXML::SAX

This code uses XML::LibXML in SAX mode, so it does not have to build the DOM for the entire document before starting processing.

I found setting up a SAX handler a tad cumbersome (the package needs to be an object) but overall using SAX is very similar to using XML::Parser's native mode, with the benefit of a standard API, better namespace support and the ability to chain SAX filters, possibly implemented using different modules. Once used to it it is undoubtly a better choice than XML::Parser's specific interface.

Curiously this is much slower than "native" XML::LibXML, despite not having to build the entire DOM. It is still pretty fast though.

[download ex_ps_sax_libxml]

#!/usr/bin/perl -w
# Using XML::LibXML::SAX
# author: Michel Rodriguez <mirod@xmltwig.com>
# with help from Aaron Straup Cope and Wolfgang Loch

use strict;

use XML::LibXML::SAX::Parser;
my $handler= my_handler->new();

my $parser = XML::LibXML::SAX::Parser->new( Handler => $handler);
$parser->parse_uri( '/article/ways_to_rome/REC_xml_19980210.xml');

package my_handler;

use base qw(XML::SAX::Base);

sub new { return bless {}; }

sub start_element {
	my ($h, $el) = @_;
	if ($el->{Name} eq 'rhs')     { $h->in_rhs = 1; } 
	elsif ($el->{Name} eq 'lhs')  { $h->in_lhs = 1; }
	elsif ($el->{Name} eq 'prod') { $h->rhs = ''; $h->lhs = ''; }
}

sub end_element {
	my ($h, $el) = @_;
	if ($el->{Name} eq 'rhs') {
		$h->in_rhs = undef;
	} 
	elsif ($el->{Name} eq 'lhs') {
		$h->in_lhs = undef;
	}
	elsif ($el->{Name} eq 'prod') {
		$h->counter++;
		my $prod = "[" . $h->counter. "] " . $h->lhs ." ::= " .$h->rhs ;
                $prod= clean( $prod);
	        print $prod . "\n";
	}
}

sub characters {
	my ($h, $characters) = @_;
	if ($h->in_lhs) {
		$h->lhs .= $characters->{Data};
	}
	elsif ($h->in_rhs) {
		$h->rhs .= $characters->{Data};
	}
}

# accessors, to be completely OO kosher
# works only for perl 5.6.0 and above
sub in_lhs  : lvalue { my $self= shift; $self->{in_lhs} ; }
sub lhs     : lvalue { my $self= shift; $self->{lhs}    ; }
sub in_rhs  : lvalue { my $self= shift; $self->{in_rhs} ; }
sub rhs     : lvalue { my $self= shift; $self->{rhs}    ; }
sub counter : lvalue { my $self= shift; $self->{counter}; }

sub clean { 
        my( $s)= @_;
        $s=~ s/\xc2\xa0/ /sg; 
	$s=~ s/\s+/ /g; $s=~ s{\s$}{};
        return $s;
}


3.15 XML::SAX::Expat

This module is a SAX layer over XML::Parser, giving it a standard SAX API.

The code is nearly the same as the one for XML::LibXML::SAX. I just had to change the call to parse_file to a call to parse_uri as parse_file does not seem to be working in version 0.35 of the module. The code also spits out lots of warnings, due to namespaces not being used in the XML document.

Note that installing the module was quite a pain, because a similar module, XML::SAX::Expat-XS, which should replace XML::SAX::Expat when it's ready for prime time, was installed as XML::SAX::Expat. Re-installing XML::SAX::Expat did not work as it was installed in a lower-priority directory, so I had to go and remove the annoying file from the Perl library directory. This should be fixed in a future release of XML::SAX::Expat-XS (and thanks to Petr Cimprich for pointing this out).

Of course XML::SAX::Expat is slower than XML::Parser, but you gain a standard API. XML::SAX::Expat-XS should be faster.

[download ex_ps_sax_expat]

#!/usr/bin/perl -w
# Using XML::SAX::Expat
# author: Michel Rodriguez <mirod@xmltwig.com>

use strict;

use XML::SAX::Expat;
my $handler= my_handler->new();

my $parser = XML::SAX::Expat->new( Handler => $handler,
                                   Features => { 'http://xml.org/sax/features/namespaces' => 0, },

);
$parser->parse_uri( '/article/ways_to_rome/REC_xml_19980210.xml');

exit;

package my_handler;

use base qw(XML::SAX::Base);

sub new { return bless {}; }

sub start_element {
	my ($h, $el) = @_;
	if ($el->{Name} eq 'rhs')     { $h->in_rhs = 1; } 
	elsif ($el->{Name} eq 'lhs')  { $h->in_lhs = 1; }
	elsif ($el->{Name} eq 'prod') { $h->rhs = ''; $h->lhs = ''; }
}

sub end_element {
	my ($h, $el) = @_;
	if ($el->{Name} eq 'rhs') {
		$h->in_rhs = undef;
	} 
	elsif ($el->{Name} eq 'lhs') {
		$h->in_lhs = undef;
	}
	elsif ($el->{Name} eq 'prod') {
		$h->counter++;
		my $prod = "[" . $h->counter. "] " . $h->lhs ." ::= " .$h->rhs ;
                $prod= clean( $prod);
	        print $prod . "\n";
	}
}

sub characters {
	my ($h, $characters) = @_;
	if ($h->in_lhs) {
		$h->lhs .= $characters->{Data};
	}
	elsif ($h->in_rhs) {
		$h->rhs .= $characters->{Data};
	}
}

# accessors, to be completely OO kosher
# works only for perl 5.6.0 and above
sub in_lhs  : lvalue { my $self= shift; $self->{in_lhs} ; }
sub lhs     : lvalue { my $self= shift; $self->{lhs}    ; }
sub in_rhs  : lvalue { my $self= shift; $self->{in_rhs} ; }
sub rhs     : lvalue { my $self= shift; $self->{rhs}    ; }
sub counter : lvalue { my $self= shift; $self->{counter}; }

sub clean { 
        my( $s)= @_;
        $s=~ s/\xc2\xa0/ /sg; 
	$s=~ s/\s+/ /g; $s=~ s{\s$}{};
        return $s;
}


3.16 XML::XSLT

XML::XSLT is a pure Perl implementation of XSLT. It is still a work-in-progress but is already quite convenient. The example includes an xslt script, launched from a Perl script that also does some post-processing, general formatting (XSLT is more aimed at an XML output where additional whitespaces can be ignored) and adding the rule number (the position function that should display that number is not yet implemented, it might become available in a future version, the output is not text-oriented).

Considering XML::XSLT is based on XML::DOM (the XSLT script is applied to a DOM of the original document) the performances are pretty good, about twice as slow as XML::DOM.

Other modules provide interfaces with non-perl XSLT engines: XML::LibXSLT (see below), XML::Sablotron and XML::Xalan. These engines are generally faster and implement more of the XSLT spec than XML::XSLT.

The XSL script is:

ex_ps_xslt.xslt

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text" />
  <xsl:template match='/index.html'>
        <xsl:for-each select=".//prod">
		[<xsl:value-of select="position()" />] <xsl:value-of select="lhs" /> ::= 
		<xsl:for-each select=".//rhs"> <xsl:value-of select="." /></xsl:for-each>
        </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

The Perl wrapper:

[download ex_ps_xslt]

#!/usr/bin/perl -w

# author G.P.H. Josten <gjosten@sci.kun.nl>
# post processing by Michel Rodriguez <m.v.rodriguez@ieee.org>

use strict;
use XML::XSLT qw(serve); 

# we need to replace the entities as <xml:output mode="text" /> does not work
my %ent= ( amp  => '&', quot => '"', apos => "'",
	   lt   => '<', gt   => '>', '#160' => ' ', 
         );

set_filter();    # sets the post-processing filter
my $xslt   = XML::XSLT->new( '/article/ways_to_rome/ex_ps_xslt.xslt');
print $xslt->serve( Source => '/article/ways_to_rome/REC_xml_19980210.xml');

exit;

# filter applied to the output of the XSLT transformation
sub set_filter
  { return if my $pid= open( STDOUT, "|-");    # magic open
    die "cannot fork: $!" unless defined $pid;
    $|=1;
    my $i=0;
    my $prod='';
    while( <STDIN>)
      { if( m/^\s*\[\]/)
          { unless( $i==0)
              { $prod= add_pos( $prod, $i); 
                print clean( $prod), "\n" ;
              }
            $i++;
            $prod= '';
          }
          $prod .= $_;
        }
    $prod= add_pos( $prod, $i); 
    print clean( $prod), "\n"; 
    exit;
  }
        
        
sub clean { 
        my( $string)= @_;

	# replace entities
        $string=~ s{&(.*?);}{$ent{$1}}g;  
       	
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}; $string=~ s{^\s}{};
        return $string;
  }

# needed because XSLT does no yet implement position()
sub add_pos
  { my( $prod, $i)= @_;
    $prod=~ s{^\s*\[\]}{\n[$i]};
    return $prod;
  }

3.17 XML::LibXSLT

XML::LibXSLT is the Perl interface to the GNOME libxslt library. You need to install libxml2 and libxslt to use it (I found them very easy to install).

The stylesheet is exactly the same as in the previous example, the code is straight from the module documentation, plus a simplified version of the post-processing filter used for XML::XSLT. It took about 2 minutes to write and to get to work. Of course the fact that the stylesheet and the post-processing function were already written helped a lot!

The post-processing code is a good example of why you want to use Perl and XSLT together: XSLT handles the XML processing and Perl takes care of dealing easily with the details of the text output.

XML::LibXSLT (in fact libxslt itself) is really fast, just as fast as XML::LibXML in this case.

[download ex_ps_libxslt]

#!/usr/bin/perl -w
# Author: Michel Rodriguez <m.v.rodriguez@ieee.org>
 
use strict;

use XML::LibXSLT;
use XML::LibXML;

my $parser = XML::LibXML->new();
my $xslt = XML::LibXSLT->new();

my $source = $parser->parse_file('/article/ways_to_rome/REC_xml_19980210.xml');
my $style_doc = $parser->parse_file('/article/ways_to_rome/ex_ps_xslt.xslt');

my $stylesheet = $xslt->parse_stylesheet($style_doc);

my $results = $stylesheet->transform($source);

set_filter();    # sets the post-processing filter
print $stylesheet->output_string($results);


# filter applied to the output of the XSLT transformation
sub set_filter
  { return if my $pid= open( STDOUT, "|-");    # magic open
    die "cannot fork: $!" unless defined $pid;
    $|=1;
    my $prod='';
    while( <STDIN>)
      { if( m/^\s*\[\d+\]/) # found the beginning of a production
          { print( clean($prod), "\n") if( $prod=~ /\S/);
            $prod= '';
          }
          $prod .= $_;
        }
    print clean( $prod),"\n";
    exit;
  }
        
sub clean { 
        my( $string)= @_;
        $string =~ s/\xc2\xa0/ /sg;
        $string =~ s/\s+/ /g; $string=~ s{\s$}{}; $string=~ s{^\s}{};
        return $string;
}

3.18 To Do

There should certainly be more SAX examples here, with XML::SAX::PurePerl, and XML::SAX::Expat.

I also plan to add other XSLT transformations, using the other XSLT modules (I have found Sablotron and Xalan harder to install than libxslt).

Implementations in other languages are also welcome!.

4. Benchmark

The benchmark was run on a Dual-Athlon 1800 PC running Linux Mandrake 9.0, with perl 5.6.1. Results are generated by the ps_benchmark script.

Here are the timings for running the different versions:

ModuleVersionTiming (benchmark)Factor
XML::Parser2.310.12100
XML::Parser (alt) 0.73608
Perl regexp 0.03 24
XML::Parser (subs) 0.16133
XML::DOM1.410.46383
XML::XQL0.670.57474
XML::Twig3.090.33275
XML::PYX0.070.28233
XML::DT0.200.29241
XML::XPath1.121.261050
XML::Grove0.46alpha0.85708
XML::LibXML1.530.15124
XML::LibXML::SAX 0.15125
XML::SAX::Expat0.350.46383
XML::XSLT0.400.63524
XML::LibXSLT1.520.14116
XML::Parser::Lite0.550.1 83
XML::TokeParser0.040.37308

benchmark environment: perl 5.006001 on linux - Wed Nov 20 12:32:42 2002

5. Conclusions

Since the first version of XML::Parser, in 1998, a lot of work has gone into building powerful and easy-to-use modules to process XML in Perl. A lot of the modules shown here are really useful and can easily perform the kind of task presented here.

I think that in the last year or two there has been quite a change in the Perl/XMl landscape though. New modules are moving away from XML::Parser and using XML::LibXML (for power) or XML::SAX::PurePerl (for ease of installation). This allows users and module developers to rely on standard APIs instead of using the old (pre-SAX) XML::Parser one.

As it is Perl is very well suited to processing XML and offers (way!) more than one way to do it.

Notes:


copyright 1999-2002 by GMD-IPSI (Ingo Macherius) and Michel Rodriguez.
This article is free documentation; you can redistribute it and/or modify it under the same terms as Perl itself.