python - How to improve eXSLT performance problems when using functions -


tl;dr:

it seems running exslt waay slower counterpart in xslt2. (7 minutes vs 18 hours)

below explain problem, writing down both implementations of same transform, in exslt , xslt2.

of course, engines different, xslt2 use saxonhe, , exslt use python lxml.

and ask improve speed in exslt part, prefer use python java.


i have convert large (~200k tier 1 elements) xml csv.

i've got 2 implementations:

  • one uses python, libxml underneath, , use exsl.
  • another uses saxonhe, use xsl2 tranformation it.

since when writing csv, have print separators if there no value element, ive taken approach:

i've created 2 functions:

myf:printelement receives element , number represents number of separators must written if element empty.

myf:printattr receives attribute, , prints plus separator.

if defined separator as:

<xsl:param name="delim" select="','"/> 

the functions declared in each file follows:

xslt2

<!-- shortcut function print attribute plus delimiter --> <xsl:function name="myf:printattr" as="xs:string">     <xsl:param name="pattr" as="attribute()*"/>     <xsl:value-of select="concat($pattr,$delim)"/> </xsl:function>  <!-- function call apply templates if given elements exist. else, return many delimiters number given second parameter --> <xsl:function name="myf:printelement" as="item()*">     <xsl:param name="pelement" as="element()*"/>     <xsl:param name="pcount" as="xs:integer"/>     <xsl:choose>         <xsl:when test="$pelement">             <xsl:apply-templates select="$pelement"/>         </xsl:when>         <xsl:otherwise>             <!-- explicit void separator or add space -->             <xsl:value-of select="for $i in 1 $pcount return $delim" separator=""/>         </xsl:otherwise>     </xsl:choose> </xsl:function> 

exslt

<!-- shortcut function print attribute plus delimiter --> <func:function name="myf:printattr">     <xsl:param name="pattr"/>     <func:result select="concat($pattr,$delim)"/> </func:function> <!-- function call apply templates if given elements exist. else, return many delimiters number given second parameter --> <func:function name="myf:printelement" as="item()*">     <xsl:param name="pelement" as="element()*"/>     <xsl:param name="pcount" as="xs:integer"/>     <xsl:choose>         <xsl:when test="$pelement">             <func:result>                 <xsl:apply-templates select="$pelement"/>             </func:result>         </xsl:when>         <xsl:otherwise>             <!-- explicit void separator or add space -->             <func:result select="str:padding($pcount,$delim)"/>         </xsl:otherwise>     </xsl:choose> </func:function> 

the rest of documents same.

so, lets have xml this:

<root>   <tier1 attr1="a" attr2="b"/>   <tier1 attr1="c" attr2="d">     <child2 type="1" val="abc"/>     <child2 type="3" val="123"/>   </tier1>   <tier1 attr1="e" attr2="f">     <child2 type="2" val="pancakes"/>     <child2 type="1" val="42"/>     <child3 a="h">         <child4 month="jun"/>     </child3>   </tier1> </root> 

with:

<xsl:param name="break" select="'&#xa;'"/> <xsl:template match="/">      <xsl:apply-templates select="root/tier1"/>` </xsl:template> <xsl:template match="tier1">     <xsl:value-of select="myf:printattr(@attr1)"/>     <xsl:value-of select="myf:printattr(@attr2)"/>     <xsl:value-of select="myf:printattr(child2[@type='1']/@val)"/>     <xsl:value-of select="myf:printattr(child2[@type='2']/@val)"/>     <xsl:value-of select="myf:printattr(child2[@type='3']/@val)"/>     <xsl:apply-templates/>     <!-- line break after each tier1 -->     <xsl:if test="following-sibling::*">         <xsl:value-of select="$break"/>     </xsl:if> </xsl:template> <xsl:template match="child3">     <xsl:value-of select="myf:printattr(@a)"/>     <xsl:value-of select="ama:printelement(child4,3)"/> </xsl:template> <xsl:template match="child4">     <xsl:value-of select="myf:printattr(@day)"/>     <xsl:value-of select="myf:printattr(@month)"/>     <!-- dont want comma after last element-->     <xsl:value-of select=@average/> </xsl:template> 

i desired csv output:

t1_attr1, t1_attr2, c2_t1, c2_t2, c2_t3, c3_a, c4_mont, c4_day, c4_average a,b,,,,,,, c,d,abc,,123,,,, e,f,42,pancakes,,h,jun,3,1200 

some notes above:

  • child2 can repeated under tier1, given set of values type, , not repeated.

  • also there no text inside elements, makes approach 2 functions cover possible cases can encounter. although printattr work text nodes also.

  • ive added column names make easier read. in code add @ start, inner node set exslt, simple array of string xslt2.

so, now, problem:

as said @ start, have run transform huge file, more 200k tier1 elements.

  • with saxonhe takes 7 minutes
  • with python, takes 18 hours

both transform script/program same:

  1. open file
  2. open xslt
  3. apply later former
  4. save result

i know talking of different implementations of transform engine, difference notable because of this. way test same engine using exslt under saxon-pe or saxon-ee not available in saxon-he. and, of course, there no xslt2 implementation in python.

i know why python version takes long. inherent use of exslt? or is there way improve this?

of course example xml, real 1 has lot of more elements , more complex.

this part of larger project , would'nt depend on jvm this, but, difference huge now, python not option.

thanks!

to me looks if massively over-engineering problem.

the following simple xslt 1.0 transformation

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform">   <xsl:output method="text" encoding="utf-8" />    <xsl:template match="/root">     <xsl:text>t1_attr1,t1_attr2,c2_t1,c2_t2,c2_t3,c3_a,c4_month,c4_day,c4_average</xsl:text>     <xsl:apply-templates select="tier1" />   </xsl:template>    <xsl:template match="tier1">     <xsl:text>&#xa;</xsl:text>     <xsl:value-of select="@attr1" />                   <xsl:text>,</xsl:text>     <xsl:value-of select="@attr2" />                   <xsl:text>,</xsl:text>     <xsl:value-of select="child2[@type = '1']/@val" /> <xsl:text>,</xsl:text>     <xsl:value-of select="child2[@type = '2']/@val" /> <xsl:text>,</xsl:text>     <xsl:value-of select="child2[@type = '3']/@val" /> <xsl:text>,</xsl:text>     <xsl:value-of select="child3/@a" />                <xsl:text>,</xsl:text>     <xsl:value-of select="child3/child4/@month" />     <xsl:text>,</xsl:text>     <xsl:value-of select="child3/child4/@day" />       <xsl:text>,</xsl:text>     <xsl:value-of select="child3/child4/@average" />   </xsl:template> </xsl:transform> 

when applied

<root>   <tier1 attr1="a" attr2="b">   </tier1>   <tier1 attr1="c" attr2="d">     <child2 type="1" val="abc" />     <child2 type="3" val="123" />   </tier1>   <tier1 attr1="e" attr2="f">     <child2 type="2" val="pancakes" />     <child2 type="1" val="42" />     <child3 a="h">         <child4 month="jun" day="3" average="1200" />     </child3>   </tier1> </root> 

produces

 t1_attr1,t1_attr2,c2_t1,c2_t2,c2_t3,c3_a,c4_month,c4_day,c4_average a,b,,,,,,, c,d,abc,,123,,,, e,f,42,pancakes,,h,jun,3,1200 

Comments

Popular posts from this blog

apache - PHP Soap issue while content length is larger -

asynchronous - Python asyncio task got bad yield -

javascript - Complete OpenIDConnect auth when requesting via Ajax -