Sunday, October 20, 2013

XML processing in shell.

Q: Have you ever been struggling with xml processing in shell?
A: There's an elegant tool: Xmlstarlet (http://xmlstar.sourceforge.net/).

Why would you care?

Well, if you:
  • use the shell environment (bash in my case),
  • have a need for XML data extraction/transformation and
  • you know/are willing to learn XPath/Xslt
then you should. In fact it might be the perfect match.

Download/Installation

Follow the official docs. According to the link, if you're Linux (it's easy as things should be, I guess): "Bundled with your nearest Linux distribution"

Usage

Well, I'd recommend to check the:
man xmlstarlet
and
xmlstarlet --help
I'm going to focus on the one part of the functionality only, namely: data extraction, to see the official help, go for:
xmlstarlet sel
Output is quite impressive, giving you even some examples, for those lazy (like me), I copy/paste here:
XMLStarlet Toolkit: Select from XML document(s)
Usage: xmlstarlet sel <global-options> {<template>} [ <xml-file> ... ]
where
  <global-options> - global options for selecting
  <xml-file> - input XML document file name/uri (stdin is used if missing)
  <template> - template for querying XML document with following syntax:
 
<global-options> are:
  -Q or --quiet             - do not write anything to standard output.
  -C or --comp              - display generated XSLT
  -R or --root              - print root element <xsl-select>
  -T or --text              - output is text (default is XML)
  -I or --indent            - indent output
  -D or --xml-decl          - do not omit xml declaration line
  -B or --noblanks          - remove insignificant spaces from XML tree
  -E or --encode <encoding> - output in the given encoding (utf-8, unicode...)
  -N <name>=<value>         - predefine namespaces (name without 'xmlns:')
                              ex: xsql=urn:oracle-xsql
                              Multiple -N options are allowed.
  --net                     - allow fetch DTDs or entities over network
  --help                    - display help
 
Syntax for templates: -t|--template <options>
where <options>
  -c or --copy-of <xpath>   - print copy of XPATH expression
  -v or --value-of <xpath>  - print value of XPATH expression
  -o or --output <string>   - output string literal
  -n or --nl                - print new line
  -f or --inp-name          - print input file name (or URL)
  -m or --match <xpath>     - match XPATH expression
  --var <name> <value> --break or
  --var <name>=<value>      - declare a variable (referenced by $name)
  -i or --if <test-xpath>   - check condition <xsl:if test="test-xpath">
  --elif <test-xpath>       - check condition if previous conditions failed
  --else                    - check if previous conditions failed
  -e or --elem <name>       - print out element <xsl:element name="name">
  -a or --attr <name>       - add attribute <xsl:attribute name="name">
  -b or --break             - break nesting
  -s or --sort op xpath     - sort in order (used after -m) where
  op is X:Y:Z,
      X is A - for order="ascending"
      X is D - for order="descending"
      Y is N - for data-type="numeric"
      Y is T - for data-type="text"
      Z is U - for case-order="upper-first"
      Z is L - for case-order="lower-first"
 
There can be multiple --match, --copy-of, --value-of, etc options
in a single template. The effect of applying command line templates
can be illustrated with the following XSLT analogue
 
xml sel -t -c "xpath0" -m "xpath1" -m "xpath2" -v "xpath3" 
        -t -m "xpath4" -c "xpath5"
 
is equivalent to applying the following XSLT
 
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
  <xsl:call-template name="t1"/>
  <xsl:call-template name="t2"/>
</xsl:template>
<xsl:template name="t1">
  <xsl:copy-of select="xpath0"/>
  <xsl:for-each select="xpath1">
    <xsl:for-each select="xpath2">
      <xsl:value-of select="xpath3"/>
    </xsl:for-each>
  </xsl:for-each>
</xsl:template>
<xsl:template name="t2">
  <xsl:for-each select="xpath4">
    <xsl:copy-of select="xpath5"/>
  </xsl:for-each>
</xsl:template>
</xsl:stylesheet>
 
XMLStarlet is a command line toolkit to query/edit/check/transform
XML documents (for more information see http://xmlstar.sourceforge.net/)
 
Current implementation uses libxslt from GNOME codebase as XSLT processor
(see http://xmlsoft.org/ for more details)
Please note the switch:
-C or --comp              - display generated XSLT
This can really help if you're struggling on particular problem, to see what's really done in the background.

Let's jump directly to examples.

XmlStarlet in action

Imagine sample xml (copied from http://www.w3schools.com/xml/xml_attributes.asp) called sample.xml:
<messages>
  <note id="501">
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
  </note>
  <note id="502">
    <to>Jani</to>
    <from>Tove</from>
    <heading>Re: Reminder</heading>
    <body>I will not</body>
  </note>
</messages>


Element value selection

Let's say, we want to extract all 'from' element values. Using:
xmlstarlet sel -t -m "messages" -m "note" -v "from" < sample.xml
we get:
JaniTove
So what have we done?
  • sel - select data or query XML document
  • -t - template definition
  • -m "messages" - match "messages" XPath expression
  • -m "note" - match "note" XPath expression
  • -v "from" - print value of "from" XPath expression
  • sample.xml - test input file used
Do you want to see the xslt used in the background? No problem, just go for:
xmlstarlet sel -C -t -m "messages" -m "note" -v "from" < sample.xml
to see:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:exslt="http://exslt.org/common" version="1.0" extension-element-prefixes="exslt">
  <xsl:output omit-xml-declaration="yes" indent="no"/>
  <xsl:template match="/">
    <xsl:for-each select="messages">
      <xsl:for-each select="note">
        <xsl:call-template name="value-of-template">
          <xsl:with-param name="select" select="from"/>
        </xsl:call-template>
      </xsl:for-each>
    </xsl:for-each>
  </xsl:template>
  <xsl:template name="value-of-template">
    <xsl:param name="select"/>
    <xsl:value-of select="$select"/>
    <xsl:for-each select="exslt:node-set($select)[position()&gt;1]">
      <xsl:value-of select="'&#10;'"/>
      <xsl:value-of select="."/>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>
easy, right?

Selecting xml element

Do you need whole element, not just value? Go for:
xmlstarlet sel -t -m "messages" -m "note" -c "from" < test.xml
to see:
<from>Jani</from><from>Tove</from>
Where only new flag introduced is:
  • -c "from" - print value of "from" XPath expression

Attribute values

Let's end the examples with id-attribute values. Let's assume, we're interested in the note's id attribute values:
xmlstarlet sel -t -m "messages" -m "note" -v "@id" < test.xml
should result in:
501502

Conclusion

XmlStarlet is much more powerful that shown in the examples above.
However my goal was not to show it all (if interested, go for official docs) but rather to show you the tool that can help with xml processing in the shell scripts.
It can save you from quite some unneeded complexity possibly introduced by sed/awk/grep solutions as it respects commented sections, ... and brings the full XPath power to your scripting.

Well, I need to admit, that the tool's development is somewhat stalled. Usually, that's not a good indicator. Still I consider it mature and extremely useful.

1 comment:

poplifestyle said...

Thanks for posting thhis