Count element occurances in a group of XML files and provide a report

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Count element occurances in a group of XML files and provide a report

Andrew Welch
There is a requirement to provide a statistical analysis of a group of
XML files (1700 of them) all with a very similar structure, the
majority with the same values, but a few with different values than
the rest.

For example, 1699 of the files will have:

<root>
  <node value="A"/>

But 1 might have:

<root>
  <node value="B"/>

I need to generate a report like:

node value 'A' 1699
node value 'B' 1

This is reasonably simple to do in Java, I was wondering if it was
possible to do using XQuery and Saxon 8.5.1B ?

The problem could be narrowed down to specific known elements, such as
"give me all the distinct values for //node/@value (across all 1700
files) and how many times they occur" - is this possible?


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: Count element occurances in a group of XML files and provide a report

David Carlisle

something like

select="count(collection("whatever you need to pick up all the xml")/
  saxon:discard-document(.)/root/node[@value='A'])"

I think...


________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

RE: Count element occurances in a group of XML files and provide a report

Michael Kay
In reply to this post by Andrew Welch
Something like this perhaps:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:saxon="http://saxon.sf.net/"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="saxon xs">

<xsl:output indent="yes"/>

<xsl:template name="main">
  <xsl:variable name="root-names" as="xs:QName*"
     select="for $x in
collection('file:///c:/MyJava/tests/testsuite?select=*.xml;recurse=yes;on-er
ror=ignore')
             return saxon:discard-document($x)/*/node-name(.)"/>          
  <xsl:for-each-group select="$root-names" group-by=".">
    <xsl:sort select="count(current-group())" order="descending"/>
    <element name="{current-grouping-key()}"
count="{count(current-group())}"/>
  </xsl:for-each-group>
</xsl:template>
   
</xsl:stylesheet>

Producing output:

<?xml version="1.0" encoding="UTF-8"?>
<element name="doc" count="1850"/>
<element name="far-north" count="599"/>
<element name="a" count="98"/>
<element name="docs" count="55"/>
<element name="t04" count="34"/>
<element name="xsl:stylesheet" count="25"/>
<element name="doc" count="25"/>
<element name="config" count="23"/>
<element name="data" count="19"/>
<element name="sales" count="19"/>
<element name="foo" count="18"/>
<element name="dummy" count="18"/>
<element name="root" count="17"/>
etc...


The logic here is designed to get the required data out of each document and
then discard it from memory as soon as it's done with.

This example processes about 3000 small XML files in 12 seconds.

I was puzzled about doc appearing twice in the list. It's actually two
different QNames with the same lexical representation, verified by adding
the attribute

uri="{namespace-uri-from-QName(current-grouping-key())}"

to the output.

Michael Kay
http://www.saxonica.com/


> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of
> andrew welch
> Sent: 14 October 2005 17:09
> To: [hidden email]
> Subject: [saxon] Count element occurances in a group of XML
> files and provide a report
>
> There is a requirement to provide a statistical analysis of a group of
> XML files (1700 of them) all with a very similar structure, the
> majority with the same values, but a few with different values than
> the rest.
>
> For example, 1699 of the files will have:
>
> <root>
>   <node value="A"/>
>
> But 1 might have:
>
> <root>
>   <node value="B"/>
>
> I need to generate a report like:
>
> node value 'A' 1699
> node value 'B' 1
>
> This is reasonably simple to do in Java, I was wondering if it was
> possible to do using XQuery and Saxon 8.5.1B ?
>
> The problem could be narrowed down to specific known elements, such as
> "give me all the distinct values for //node/@value (across all 1700
> files) and how many times they occur" - is this possible?
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Power Architecture Resource Center: Free content, downloads,
> discussions,
> and more. http://solutions.newsforge.com/ibmarch.tmpl
> _______________________________________________
> saxon-help mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/saxon-help
>




-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: Count element occurances in a group of XML files and provide a report

Andrew Welch
Thanks - really good examples - I wasn't thinking about XSLT but I'll
use it now :)

Out of interest - is there an XQuery equivalent?


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

RE: Count element occurances in a group of XML files and provide a report

Michael Kay
In reply to this post by Andrew Welch
> This is reasonably simple to do in Java, I was wondering if it was
> possible to do using XQuery and Saxon 8.5.1B ?

I just spotted you wanted an XQuery solution.

declare namespace saxon="http://saxon.sf.net/";
let $root-names := for $x in
collection('file:///c:/MyJava/tests/testsuite?select=*.xml;recurse=yes;on-er
ror=ignore')
             return saxon:discard-document($x)/*/node-name(.)
let $distinct-root-names := distinct-values($root-names)
for $n in $distinct-root-names
let $c := count($root-names[. eq $n])
order by $c descending
return                  
    <e name="{$n}" count="{$c}"/>


Michael Kay
http://www.saxonica.com/




-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: Count element occurances in a group of XML files and provide a report

Andrew Welch
On 10/14/05, Michael Kay <[hidden email]> wrote:

> > This is reasonably simple to do in Java, I was wondering if it was
> > possible to do using XQuery and Saxon 8.5.1B ?
>
> I just spotted you wanted an XQuery solution.
>
> declare namespace saxon="http://saxon.sf.net/";
> let $root-names := for $x in
> collection('file:///c:/MyJava/tests/testsuite?select=*.xml;recurse=yes;on-er
> ror=ignore')
>              return saxon:discard-document($x)/*/node-name(.)
> let $distinct-root-names := distinct-values($root-names)
> for $n in $distinct-root-names
> let $c := count($root-names[. eq $n])
> order by $c descending
> return
>     <e name="{$n}" count="{$c}"/>

Ahh ok thanks.  saxon:discard-document() is still needed (?) - I
thought this would be an ideal problem to solve using XQuery.... is
there any reason to use XQuery over XSLT when an XML database isn't
involved?


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: Count element occurances in a group of XML files and provide a report

Andrew Welch
On 10/14/05, Michael Kay <[hidden email]> wrote:

> > This is reasonably simple to do in Java, I was wondering if it was
> > possible to do using XQuery and Saxon 8.5.1B ?
>
> I just spotted you wanted an XQuery solution.
>
> declare namespace saxon="http://saxon.sf.net/";
> let $root-names := for $x in
> collection('file:///c:/MyJava/tests/testsuite?select=*.xml;recurse=yes;on-er
> ror=ignore')
>              return saxon:discard-document($x)/*/node-name(.)
> let $distinct-root-names := distinct-values($root-names)
> for $n in $distinct-root-names
> let $c := count($root-names[. eq $n])
> order by $c descending
> return
>     <e name="{$n}" count="{$c}"/>

Ahh ok thanks.

For this kind of problem would you say XSLT or XQuery is best - they
seem pretty interchangable...


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

RE: Count element occurances in a group of XML files and provide a report

Michael Kay
In reply to this post by Andrew Welch
 
> Ahh ok thanks.  saxon:discard-document() is still needed (?) - I
> thought this would be an ideal problem to solve using XQuery.... is
> there any reason to use XQuery over XSLT when an XML database isn't
> involved?

It's more concise, if that's something you value, and it probably comes more
naturally to people with an SQL background. Also query products tend to be
more optimized for data applications (e.g. joins).

Possibly because XQuery is a smaller language with no polymorphism it's more
amenable to the kind of static analysis that would enable one to insert the
discard-document() automatically.

Michael Kay
http://www.saxonica.com/




-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

RE: Count element occurances in a group of XML files and provide a report

Michael Kay
In reply to this post by Andrew Welch
> For this kind of problem would you say XSLT or XQuery is best - they
> seem pretty interchangable...

I tend to reach for XSLT first out of habit, but I agree, they're pretty
interchangeable.

Michael Kay




-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: Count element occurances in a group of XML files and provide a report

Andrew Welch
On 10/14/05, Michael Kay <[hidden email]> wrote:
> > For this kind of problem would you say XSLT or XQuery is best - they
> > seem pretty interchangable...
>
> I tend to reach for XSLT first out of habit, but I agree, they're pretty
> interchangeable.

Thanks

(gmail told me my first reply didn't send so I sent another reply...
sorry about that.)


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

Re: Count element occurances in a group of XML files and provide a report

Andrew Welch
In reply to this post by Michael Kay
On 10/14/05, Michael Kay <[hidden email]> wrote:

> Something like this perhaps:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet version="2.0"
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> xmlns:saxon="http://saxon.sf.net/"
> xmlns:xs="http://www.w3.org/2001/XMLSchema"
> exclude-result-prefixes="saxon xs">
>
> <xsl:output indent="yes"/>
>
> <xsl:template name="main">
>   <xsl:variable name="root-names" as="xs:QName*"
>      select="for $x in
> collection('file:///c:/MyJava/tests/testsuite?select=*.xml;recurse=yes;on-er
> ror=ignore')
>              return saxon:discard-document($x)/*/node-name(.)"/>
>   <xsl:for-each-group select="$root-names" group-by=".">
>     <xsl:sort select="count(current-group())" order="descending"/>
>     <element name="{current-grouping-key()}"
> count="{count(current-group())}"/>
>   </xsl:for-each-group>
> </xsl:template>
>
> </xsl:stylesheet>
>
> Producing output:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <element name="doc" count="1850"/>
> <element name="far-north" count="599"/>
> <element name="a" count="98"/>
> <element name="docs" count="55"/>
> <element name="t04" count="34"/>
> <element name="xsl:stylesheet" count="25"/>
> <element name="doc" count="25"/>
> <element name="config" count="23"/>
> <element name="data" count="19"/>
> <element name="sales" count="19"/>
> <element name="foo" count="18"/>
> <element name="dummy" count="18"/>
> <element name="root" count="17"/>
> etc...

I spent a hour this morning painstakingly going over each line of code
(almost each - the ones I thought were important anyway) trying to
find out why this stylesheet wasn't producing any output for me.
Surely something about saxon:discard-document() wasn't right... maybe
collection() - were my paths correct?  I was using an xsl:message
after the variable to output somes *** and the contents of the
variable - the *** came out but the variable was empty every time....
and then I spotted it:

<xsl:template name="main">
                      ^^^^^^^^^^^^^^^^^^

...instead of a root a matching template and applying the stylesheet
to itself (which is what I was doing).  It makes you smile afterwards,
but while you can't figure out why something that should just work
doesn't, it's so annoying.

Anyway, it works really well now (and is surprisingly fast), thanks again.


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help
Reply | Threaded
Open this post in threaded view
|

RE: Count element occurances in a group of XML files and provide a report

Michael Kay
 
> I spent a hour this morning painstakingly going over each line of code
> (almost each - the ones I thought were important anyway) trying to
> find out why this stylesheet wasn't producing any output for me.

The -T option is always worth trying.

Michael Kay
http://www.saxonica.com/




-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
saxon-help mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/saxon-help