python - How to improve eXSLT performance problems when using functions -
tl;dr:
it seems running exslt waay slower counterpart in xslt2. (7 minutes vs 18 hours)
below explain problem, writing down both implementations of same transform, in exslt , xslt2.
of course, engines different, xslt2 use saxonhe, , exslt use python lxml.
and ask improve speed in exslt part, prefer use python java.
i have convert large (~200k tier 1 elements) xml csv.
i've got 2 implementations:
- one uses python, libxml underneath, , use exsl.
- another uses saxonhe, use xsl2 tranformation it.
since when writing csv, have print separators if there no value element, ive taken approach:
i've created 2 functions:
myf:printelement
receives element , number represents number of separators must written if element empty.
myf:printattr
receives attribute, , prints plus separator.
if defined separator as:
<xsl:param name="delim" select="','"/>
the functions declared in each file follows:
xslt2
<!-- shortcut function print attribute plus delimiter --> <xsl:function name="myf:printattr" as="xs:string"> <xsl:param name="pattr" as="attribute()*"/> <xsl:value-of select="concat($pattr,$delim)"/> </xsl:function> <!-- function call apply templates if given elements exist. else, return many delimiters number given second parameter --> <xsl:function name="myf:printelement" as="item()*"> <xsl:param name="pelement" as="element()*"/> <xsl:param name="pcount" as="xs:integer"/> <xsl:choose> <xsl:when test="$pelement"> <xsl:apply-templates select="$pelement"/> </xsl:when> <xsl:otherwise> <!-- explicit void separator or add space --> <xsl:value-of select="for $i in 1 $pcount return $delim" separator=""/> </xsl:otherwise> </xsl:choose> </xsl:function>
exslt
<!-- shortcut function print attribute plus delimiter --> <func:function name="myf:printattr"> <xsl:param name="pattr"/> <func:result select="concat($pattr,$delim)"/> </func:function> <!-- function call apply templates if given elements exist. else, return many delimiters number given second parameter --> <func:function name="myf:printelement" as="item()*"> <xsl:param name="pelement" as="element()*"/> <xsl:param name="pcount" as="xs:integer"/> <xsl:choose> <xsl:when test="$pelement"> <func:result> <xsl:apply-templates select="$pelement"/> </func:result> </xsl:when> <xsl:otherwise> <!-- explicit void separator or add space --> <func:result select="str:padding($pcount,$delim)"/> </xsl:otherwise> </xsl:choose> </func:function>
the rest of documents same.
so, lets have xml this:
<root> <tier1 attr1="a" attr2="b"/> <tier1 attr1="c" attr2="d"> <child2 type="1" val="abc"/> <child2 type="3" val="123"/> </tier1> <tier1 attr1="e" attr2="f"> <child2 type="2" val="pancakes"/> <child2 type="1" val="42"/> <child3 a="h"> <child4 month="jun"/> </child3> </tier1> </root>
with:
<xsl:param name="break" select="'
'"/> <xsl:template match="/"> <xsl:apply-templates select="root/tier1"/>` </xsl:template> <xsl:template match="tier1"> <xsl:value-of select="myf:printattr(@attr1)"/> <xsl:value-of select="myf:printattr(@attr2)"/> <xsl:value-of select="myf:printattr(child2[@type='1']/@val)"/> <xsl:value-of select="myf:printattr(child2[@type='2']/@val)"/> <xsl:value-of select="myf:printattr(child2[@type='3']/@val)"/> <xsl:apply-templates/> <!-- line break after each tier1 --> <xsl:if test="following-sibling::*"> <xsl:value-of select="$break"/> </xsl:if> </xsl:template> <xsl:template match="child3"> <xsl:value-of select="myf:printattr(@a)"/> <xsl:value-of select="ama:printelement(child4,3)"/> </xsl:template> <xsl:template match="child4"> <xsl:value-of select="myf:printattr(@day)"/> <xsl:value-of select="myf:printattr(@month)"/> <!-- dont want comma after last element--> <xsl:value-of select=@average/> </xsl:template>
i desired csv output:
t1_attr1, t1_attr2, c2_t1, c2_t2, c2_t3, c3_a, c4_mont, c4_day, c4_average a,b,,,,,,, c,d,abc,,123,,,, e,f,42,pancakes,,h,jun,3,1200
some notes above:
child2 can repeated under tier1, given set of values type, , not repeated.
also there no text inside elements, makes approach 2 functions cover possible cases can encounter. although printattr work text nodes also.
ive added column names make easier read. in code add @ start, inner node set exslt, simple array of string xslt2.
so, now, problem:
as said @ start, have run transform huge file, more 200k tier1 elements.
- with saxonhe takes 7 minutes
- with python, takes 18 hours
both transform script/program same:
- open file
- open xslt
- apply later former
- save result
i know talking of different implementations of transform engine, difference notable because of this. way test same engine using exslt under saxon-pe or saxon-ee not available in saxon-he. and, of course, there no xslt2 implementation in python.
i know why python version takes long. inherent use of exslt? or is there way improve this?
of course example xml, real 1 has lot of more elements , more complex.
this part of larger project , would'nt depend on jvm this, but, difference huge now, python not option.
thanks!
to me looks if massively over-engineering problem.
the following simple xslt 1.0 transformation
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform"> <xsl:output method="text" encoding="utf-8" /> <xsl:template match="/root"> <xsl:text>t1_attr1,t1_attr2,c2_t1,c2_t2,c2_t3,c3_a,c4_month,c4_day,c4_average</xsl:text> <xsl:apply-templates select="tier1" /> </xsl:template> <xsl:template match="tier1"> <xsl:text>
</xsl:text> <xsl:value-of select="@attr1" /> <xsl:text>,</xsl:text> <xsl:value-of select="@attr2" /> <xsl:text>,</xsl:text> <xsl:value-of select="child2[@type = '1']/@val" /> <xsl:text>,</xsl:text> <xsl:value-of select="child2[@type = '2']/@val" /> <xsl:text>,</xsl:text> <xsl:value-of select="child2[@type = '3']/@val" /> <xsl:text>,</xsl:text> <xsl:value-of select="child3/@a" /> <xsl:text>,</xsl:text> <xsl:value-of select="child3/child4/@month" /> <xsl:text>,</xsl:text> <xsl:value-of select="child3/child4/@day" /> <xsl:text>,</xsl:text> <xsl:value-of select="child3/child4/@average" /> </xsl:template> </xsl:transform>
when applied
<root> <tier1 attr1="a" attr2="b"> </tier1> <tier1 attr1="c" attr2="d"> <child2 type="1" val="abc" /> <child2 type="3" val="123" /> </tier1> <tier1 attr1="e" attr2="f"> <child2 type="2" val="pancakes" /> <child2 type="1" val="42" /> <child3 a="h"> <child4 month="jun" day="3" average="1200" /> </child3> </tier1> </root>
produces
t1_attr1,t1_attr2,c2_t1,c2_t2,c2_t3,c3_a,c4_month,c4_day,c4_average a,b,,,,,,, c,d,abc,,123,,,, e,f,42,pancakes,,h,jun,3,1200
Comments
Post a Comment