Essentially per Arjun Ray's suggestion of 20Feb2000:
Taking the examples from the pyxie article
<Person> <?A4TypeSetter PageBreak?> <Surname>McGrath</Surname> <Given>Sean</Given> <e-mail type="internet">sean@digitome.com</e-mail> </Person>
becomes
<Person > <?A4TypeSetter PageBreak ?> <Surname >McGrath</Surname > <Given >Sean</Given > <e-mail type="internet" >sean@digitome.com</email > </Person >
Example 2:
<?xml version="1.0" encoding="us-ascii"?> <!DOCTYPE foo SYSTEM "http://www.digitome.com/foo.dtd"> <!-- This document has a <foo> element --> <foo not="not"> <![CDATA[ Although this looks like another <foo> start-tag it is not. ]]>  Hello </foo>
becomes
<?xml version="1.0" encoding="us-ascii" ?><!DOCTYPE foo SYSTEM "http://www.digitome.com/foo.dtd"> <!-- This document has a <foo> element --> <foo not="not" > Although this looks like another <foo> start-tag it is not. ]]>  Hello </foo>
counting the number of foo elements:
grep '<foo$' my.xml
finding 'not' as an attribute name:
grep '^not=' my.xml
the general case, in perl:
while(<>){ if(s/^(-->|\?>|>)//){ # end of markup if(s/^([^<]+)//){ my($data) = $1; # process data } } elsif(s/^SYSTEM \"([^\"]+)\">/){ my($sysid) = $1; # may need unescaping? } elsif(s/^PUBLIC \"([^\"]+)\"/){ my($pubid) = $1; # may need unescaping? } if(s/^<(\S+)/){ # start tag my($name) = $1; } elsif(s/^([a-zA-Z_][^=]*)=\"([^\"]+)\"){ #attribute my($name, $val) = ($1, $2); # note that $val still has escaped <s and such } elsif(s,^</(\S+),,){ # end tag my($name) = $1; } elsif(s,^<--(.*),,){ # comment my($comment) = $1; } elsif(s,^<?(.*),,){ # processing instruction my($pi) = $1; } }
asdf
see also: XMLWriter.py, a SAX Handler that implements a tweak on this algorithm (plus some indentation stuff).