[Closed] Fast reading of XML files
Bellow I’ve written some code that searches recursively through a selected directory for xml files, reads them and adds one of their information to an array
dir = @"D:\Test"
TimeStart = timestamp()
XML_Array = #()
Name_Array = #()
fn getFilesRecursive root =
(
dir_array = GetDirectories (root + @"\*")
for d in dir_array do
(
xml_name = (getfiles (d + @"*.xml"))
if xml_name.count == 0 then join dir_array (GetDirectories (d + @"\*"))
else join XML_Array xml_name
)
)
getFilesRecursive dir
TimeEnd = timestamp()
print ("XML Count: " + XML_Array.count as string)
print ("File finding took " + ((TimeEnd - TimeStart) / 1000.0) as string + "s")
TimeStart = timestamp()
for i in XML_Array do
(
xmlDoc = dotNetObject "system.xml.xmlDocument"
xmlDoc.load i
xmlroot = xmlDoc.DocumentElement.FirstChild
myNodes = xmlroot.selectNodes "name"
if myNodes.itemOf[0] != undefined then append Name_Array myNodes.itemOf[0].innertext
)
TimeEnd = timestamp()
print ("XML reading took " + ((TimeEnd - TimeStart) / 1000.0) as string + "s")
It works, however it is very slow when dealing with large numbers of files. I want to make this snippet as fast as humanly possible. Could any of you point me towards a way of doing that? External programs, C# assemblies, dll files, anything.
I have next to no knowledge of programming languages outside of maxscript and I have a feeling the solution will require the knowledge of one. I would be extremely grateful for any help you offer me!
dir = GetDir #maxroot
TimeStart = timestamp()
XML_Array = #()
Name_Array = #()
fn getFilesRecursive root =
(
dir_array = GetDirectories (root + @"\*")
for d in dir_array do
(
xml_name = (getfiles (d + @"*.xml"))
if xml_name.count == 0 then join dir_array (GetDirectories (d + @"\*"))
else join XML_Array xml_name
)
)
getFilesRecursive dir
TimeEnd = timestamp()
print ("XML Count: " + XML_Array.count as string)
print ("File finding took " + ((TimeEnd - TimeStart) / 1000.0) as string + "s")
TimeStart = timestamp()
fn parseXml filename = (
f = openFile filename mode:"r"
while not eof f do (
if skipToString f "<name>" != undefined then (
local n = readLine f
append Name_Array (substring n 1 ((findString n "<")-1))
free f
break
)
)
)
for i in XML_Array do
(
-- these files throw some xml errors while using dotnet, so i excluded them to compare parsing performance
case filenameFromPath i of (
("AdlmIntRes.xml"):continue
("umbrella-books-data.xml"):continue
("mrMetaSLSupportList.xml"): continue
("NodetoMSL_MappingTable.xml"):continue
default:()
)
parseXml i
/*
xmlDoc = dotNetObject "system.xml.xmlDocument"
xmlDoc.load i
xmlroot = xmlDoc.DocumentElement.FirstChild
if xmlroot == undefined then continue
myNodes = xmlroot.selectNodes "name"
if myNodes.itemOf[0] != undefined then append Name_Array myNodes.itemOf[0].innertext
*/
)
TimeEnd = timestamp()
print ("XML reading took " + ((TimeEnd - TimeStart) / 1000.0) as string + "s")
Not sure whether my attempt returns valid names, but it five times as fast at least
I also found that “System.Xml.XmlReader” could be used for exact same purpose. Will look into it in a spare time.
fn regexXml filename = (
pattern = "<name>(.*?)</name>"
match = (dotNetClass "System.Text.RegularExpressions.RegEx").Match ((dotnetclass "System.IO.File").readalltext filename) pattern
if match.Success then (
append Name_Array match.Groups.item[1].value
)
)
fn parseXmlReader filename = (
reader = (dotNetclass "system.xml.xmlreader").create filename
reader.MoveToContent()
while reader.Read() do (
if reader.Name == "name" then (
append Name_Array (reader.ReadElementContentAsString())
break
)
)
reader.dispose()
)
regex seems like a winner while xmlreader actually twice as slower than xmldocument
Definitely an improvement! Thank you for giving my problem a shot I though for sure this would have needed a custom c# assemly or something since in this thread http://forums.cgsociety.org/archive/index.php?t-861178.html someone is having trouble with xml reading speeds as well, but in his case it is reading one big xml file instead of many small ones.
I have a question regarding the regexXml function. How should I modify the pattern variable so that it can read more than one line? For example if the xml structure was this:
<root>
<item>
<name>
<info1>Tex1</info1>
<info2>Tex2</info2>
</name>
</item>
</root>
and I wanted to append all text between the name nodes, including the brackets and new lines, how would I do that? I’ve been trying to find a way to do that but I just can’t do it.
fn regexXml filename = (
pattern = "<name>(.*?)</name>"
regex = dotNetobject "System.Text.RegularExpressions.RegEx" pattern (dotnetclass "System.Text.RegularExpressions.RegexOptions").singleline
match = regex.match ((dotnetclass "System.IO.File").readalltext filename)
if match.Success then (
append Name_Array match.Groups.item[1].value
)
)
you should probably trim spaces after parsing
I have another question. Hope I’m not troubling you.
for i in XML_Array do
(
alltxt = ((dotnetclass "System.IO.File").readalltext i)
match = (dotNetClass "System.Text.RegularExpressions.RegEx").Match alltxt "<info1>(.*?)</info1>"
if match.Success then append Info1_Array match.Groups.item[1].value
match = (dotNetClass "System.Text.RegularExpressions.RegEx").Match alltxt "<info2>(.*?)</info2>"
if match.Success then append Info2_Array match.Groups.item[1].value
)
If I search for more than one info in the xml files it seams to take more time than literally reading all the text and finding the info I want by filtering with filterstring & substituteString which I imagine shouldn’t be the case. What am I doing wrong here?
you can use ‘matches’ method which returns a collection
s = "<info113312>Tex1</info113312>
<info2>Tex2</info2>"
m = (dotNetClass "System.Text.RegularExpressions.RegEx")
r = m.matches s "<info\d+>(.*?)</info\d+>"
r.item[0].groups.item[1].value -- Tex1
r.item[1].groups.item[1].value -- Tex2
wonder if it’s faster than your filtestring approach
Hmmm, your example seems to assume that the two nodes we are searching for are right next to each other and in a specific order. Unfortunately my xml files aren’t all that consistent. Looks like I will have to depend on ye old filterstring method Thanks anyway for your help ^^