[Closed] Fast reading of XML files

May 13, 2016 6:52 pm

Bellow I’ve written some code that searches recursively through a selected directory for xml files, reads them and adds one of their information to an array

dir = @"D:\Test"
TimeStart = timestamp()
XML_Array = #()
Name_Array = #()
fn getFilesRecursive root =
(
	dir_array = GetDirectories (root + @"\*")
	for d in dir_array do
	(
		xml_name = (getfiles (d + @"*.xml"))
		if xml_name.count == 0 then join dir_array (GetDirectories (d + @"\*"))
		else join XML_Array xml_name
	)
)
getFilesRecursive dir
TimeEnd = timestamp()
print ("XML Count: " + XML_Array.count as string)
print ("File finding took " + ((TimeEnd - TimeStart) / 1000.0) as string + "s")

TimeStart = timestamp()
for i in XML_Array do
(
	xmlDoc = dotNetObject "system.xml.xmlDocument"
	xmlDoc.load i
	xmlroot = xmlDoc.DocumentElement.FirstChild
	myNodes = xmlroot.selectNodes "name"
	if myNodes.itemOf[0] != undefined then append Name_Array myNodes.itemOf[0].innertext
)
TimeEnd = timestamp()
print ("XML reading took " + ((TimeEnd - TimeStart) / 1000.0) as string + "s")

It works, however it is very slow when dealing with large numbers of files. I want to make this snippet as fast as humanly possible. Could any of you point me towards a way of doing that? External programs, C# assemblies, dll files, anything.
I have next to no knowledge of programming languages outside of maxscript and I have a feeling the solution will require the knowledge of one. I would be extremely grateful for any help you offer me!

8 Replies

Serejah

May 13, 2016 6:52 pm

dir = GetDir #maxroot
TimeStart = timestamp()
XML_Array = #()
Name_Array = #()
fn getFilesRecursive root =
(
	dir_array = GetDirectories (root + @"\*")
	for d in dir_array do
	(
		xml_name = (getfiles (d + @"*.xml"))
		if xml_name.count == 0 then join dir_array (GetDirectories (d + @"\*"))
		else join XML_Array xml_name
	)
)
getFilesRecursive dir
TimeEnd = timestamp()
print ("XML Count: " + XML_Array.count as string)
print ("File finding took " + ((TimeEnd - TimeStart) / 1000.0) as string + "s")


TimeStart = timestamp()


fn parseXml filename = (
	
	f = openFile filename mode:"r"

	while not eof f do (
		
		if skipToString f "<name>" != undefined then (
			
			local n = readLine f
			
			append Name_Array (substring n 1 ((findString n "<")-1))			
			free f		
			break
			
		)
		
	)	
	
)
for i in XML_Array do
(
        -- these files throw some xml errors while using dotnet, so i excluded them to compare parsing performance
	case filenameFromPath i of (
		
		("AdlmIntRes.xml"):continue
		("umbrella-books-data.xml"):continue
		("mrMetaSLSupportList.xml"): continue
		("NodetoMSL_MappingTable.xml"):continue
		default:()		
		
	)
	
	parseXml i
	/*
	xmlDoc = dotNetObject "system.xml.xmlDocument"
	xmlDoc.load i	
	xmlroot = xmlDoc.DocumentElement.FirstChild
	if xmlroot == undefined then continue
	myNodes = xmlroot.selectNodes "name"
	if myNodes.itemOf[0] != undefined then append Name_Array myNodes.itemOf[0].innertext
	*/
)
TimeEnd = timestamp()


print ("XML reading took " + ((TimeEnd - TimeStart) / 1000.0) as string + "s")

Not sure whether my attempt returns valid names, but it five times as fast at least
I also found that “System.Xml.XmlReader” could be used for exact same purpose. Will look into it in a spare time.

Serejah

May 13, 2016 6:52 pm

fn regexXml filename = (
	
	pattern = "<name>(.*?)</name>"
	match = (dotNetClass "System.Text.RegularExpressions.RegEx").Match ((dotnetclass "System.IO.File").readalltext filename) pattern
	
	if match.Success then (

		append Name_Array match.Groups.item[1].value
		
	)
	
	
)

fn parseXmlReader filename = (
	
	reader = (dotNetclass "system.xml.xmlreader").create filename
	reader.MoveToContent()
	
	while reader.Read() do (
      
		  if reader.Name == "name" then (
			  
				append Name_Array (reader.ReadElementContentAsString())
				break
			  
		  )	 
	  
	)
	
	reader.dispose()
)

regex seems like a winner while xmlreader actually twice as slower than xmldocument

linkbird

May 13, 2016 6:52 pm

Definitely an improvement! Thank you for giving my problem a shot I though for sure this would have needed a custom c# assemly or something since in this thread http://forums.cgsociety.org/archive/index.php?t-861178.html someone is having trouble with xml reading speeds as well, but in his case it is reading one big xml file instead of many small ones.

linkbird

May 13, 2016 6:52 pm

I have a question regarding the regexXml function. How should I modify the pattern variable so that it can read more than one line? For example if the xml structure was this:

<root>
	<item>
		<name>
			<info1>Tex1</info1>
			<info2>Tex2</info2>
		</name>
	</item>
</root>

and I wanted to append all text between the name nodes, including the brackets and new lines, how would I do that? I’ve been trying to find a way to do that but I just can’t do it.

Serejah

May 13, 2016 6:52 pm

fn regexXml filename = (
	
	pattern = "<name>(.*?)</name>"
	regex = dotNetobject "System.Text.RegularExpressions.RegEx" pattern (dotnetclass "System.Text.RegularExpressions.RegexOptions").singleline
	match = regex.match ((dotnetclass "System.IO.File").readalltext filename)
	
	if match.Success then (

		append Name_Array match.Groups.item[1].value
		
	)
)

you should probably trim spaces after parsing

linkbird

May 13, 2016 6:52 pm

I have another question. Hope I’m not troubling you.

for i in XML_Array do 
	(
	alltxt = ((dotnetclass "System.IO.File").readalltext i)
	match = (dotNetClass "System.Text.RegularExpressions.RegEx").Match alltxt "<info1>(.*?)</info1>"
	if match.Success then append Info1_Array match.Groups.item[1].value
	match = (dotNetClass "System.Text.RegularExpressions.RegEx").Match alltxt "<info2>(.*?)</info2>"
	if match.Success then append Info2_Array match.Groups.item[1].value
	)

If I search for more than one info in the xml files it seams to take more time than literally reading all the text and finding the info I want by filtering with filterstring & substituteString which I imagine shouldn’t be the case. What am I doing wrong here?

Serejah

May 13, 2016 6:52 pm

you can use ‘matches’ method which returns a collection


s = "<info113312>Tex1</info113312>
<info2>Tex2</info2>"
m = (dotNetClass "System.Text.RegularExpressions.RegEx")
r = m.matches s "<info\d+>(.*?)</info\d+>"

r.item[0].groups.item[1].value -- Tex1
r.item[1].groups.item[1].value -- Tex2

wonder if it’s faster than your filtestring approach

linkbird

May 13, 2016 6:52 pm

Hmmm, your example seems to assume that the two nodes we are searching for are right next to each other and in a specific order. Unfortunately my xml files aren’t all that consistent. Looks like I will have to depend on ye old filterstring method Thanks anyway for your help ^^