[Closed] Fast search through text file

Jan 16, 2023 1:31 am

Thank you.
Finally no error when compiling.
Here are the times – the code from my first post and the c#:

Lines: 48574

Find words 0.294 sec.
kappaArr: 1866
omicronArr: 0
upsilonArr: 0

Find words 0.037 sec.
: 0

Strange it returns 0 kappa matches while they are 1866.

1 Reply

(@serejah)

Joined: 1 year ago

Posts: 0

Jan 16, 2023 1:31 am

if the input data is the same in both cases then it is something with the c# code

Jan 16, 2023 1:31 am

The input data is the strToCheck, generated in the maxscript. I pass it without any modifications.

dp = dotnetobject "DocProcessor"
	dp.ProcessDocument strToCheck
	kp = dp.kappa
	format ": %\n"  kp.count

If I change the c# code to check for the first word on the first line(for example it is “phi” ) all the time the returned result is also 0.
Can we force the c# code to print directly in maxscript listener?

upd

Even if I remove all empty spaces at the beginning of the lines the result is 0.

1 Reply

(@serejah)

Joined: 1 year ago

Posts: 0

Jan 16, 2023 1:31 am

did you try the link I posted above? It worked well and lists weren’t empty

yes, you can print to the listener, but you’ll have to add autodesk.max.dll as a dll reference (dll of the max where you run the script)
and call
Autodesk.Max.GlobalInterface.Instance.TheListener.EditStream.Wputs & Autodesk.Max.GlobalInterface.Instance.TheListener.EditStream.Flush
or
Autodesk.Max.GlobalInterface.Instance.TheListener.EditStream.Printf

Jan 16, 2023 1:31 am

could you post a test case and the current and desired numbers?

1 Reply

(@denist)

Joined: 1 year ago

Posts: 0

Jan 16, 2023 1:31 am

as I understand it, the task is to find all occurrences of a specified substring in a large text file (~50k lines) by getting the line and position information. right?

Jan 16, 2023 1:31 am

I’m pretty sure finding all occurrences is very fast if you only get the position. so I would first find all the positions of the end of the lines and then the positions of the substring… after that it’s quick and easy to find the line of the substring.

Jan 16, 2023 1:31 am

re = python.import "re"
(
	t0 = timestamp()
	h0 = heapfree

n = (re.finditer "\n" strToCheck)

k = (re.finditer "kappa" strToCheck)
o = (re.finditer "omicron" strToCheck)
u = (re.finditer "upsilon" strToCheck)

	format "time:% heap:%\n" (timestamp() - t0) (h0 - heapfree)
)

--time:9 heap:1008L

python is very well suited for all string related methods… so we can target those numbers

1 Reply

(@serejah)

Joined: 1 year ago

Posts: 0

Jan 16, 2023 1:31 am

looks very promising. how do you convert these iterators to a mxs array?

Jan 16, 2023 1:31 am

ok, managed to test the code in max and it turns out issue with zero matches was with the regex patterns required another couple of slashes for word boundry

var reOmicron = new Regex( \"^omicron\\\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );
var reKappa   = new Regex( \"^kappa\\\\b\",   RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled ); 
var reUpsilon = new Regex( \"^upsilon\\\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );

code:

(
	::dp_assembly = (
	local source = "using System;
	using System.Text;
	using System.Collections.Generic;
	using System.Text.RegularExpressions;

	public class DocProcessor
	{
		List<int> _kappa = new List<int>();
		List<string> _omicron = new List<string>();
		List<string> _upsilon = new List<string>();
		
		public int[] kappa { 
			get 
			{ 
				return _kappa.ToArray(); 
			}
		}
		
		public string[] omicron { 
			get 
			{ 
				return _omicron.ToArray(); 
			}
		}
		
		public string[] upsilon { 
			get 
			{ 
				return _upsilon.ToArray(); 
			}
		}

		public void ProcessDocument( string doc )
		{			
			_kappa.Clear();
			_omicron.Clear();
			_upsilon.Clear();
			
			var spaces = new char[]{' ','	'};
		
			var reOmicron = new Regex( \"^omicron\\\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled );
			var reKappa   = new Regex( \"^kappa\\\\b\",   RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled ); 
			var reUpsilon = new Regex( \"^upsilon\\\\b\", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase | RegexOptions.Compiled ); 
			
			using (System.IO.StringReader sr = new System.IO.StringReader(doc))
			{
				int index = 0;
				string line;
				while ((line = sr.ReadLine()) != null)
				{
					var lline = line.TrimStart( spaces );
			
					if ( reOmicron.IsMatch( lline ) )
					{		
						_omicron.Add(line);						
					}
					else			
					if ( reKappa.IsMatch( lline ) )
					{
						_kappa.Add( index );
					}
					else
					if ( reUpsilon.IsMatch( lline ) )
					{
						_upsilon.Add( line );
					}
					
					index++;
				}
			}
		}
	}"

		csharpProvider = dotnetobject "Microsoft.CSharp.CSharpCodeProvider"
		compilerParams = dotnetobject "System.CodeDom.Compiler.CompilerParameters"
		compilerParams.ReferencedAssemblies.Add("System.dll");
		compilerParams.ReferencedAssemblies.Add("System.Windows.Forms.dll");

		compilerParams.GenerateInMemory = on
		compilerResults = csharpProvider.CompileAssemblyFromSource compilerParams #(source)


		if (compilerResults.Errors.Count > 0 ) then
		(
			local errs = stringstream ""
			for i = 0 to (compilerResults.Errors.Count-1) do
			(
				local err = compilerResults.Errors.Item[i]
				format "Error:% Line:% Column:% %\n" err.ErrorNumber err.Line err.Column err.ErrorText to:errs
			)
			format "%\n" errs
			undefined
		)
		else
		(
			compilerResults.CompiledAssembly		
		)

	)

	gc()
	newLineStr = "\n"
	newTabStr = "\t "	
	--( 	generate string
	space = " "
	nl = "\n"
	tab = "\t"
	tabNl = "\t\n"
	fmt = "%\n"
	
	wordsArr = #("alpha", "beta", "gama", "delta", "Epsilon", "Zeta", "Eta", "Theta", "Iota", "kaPPa", "LamBda", "mU", "Nu", "xi", "omicron", "pi", "rHo", "siGma", "Tau", "UpSiLoN", "pHi", "chi", "psi", "omega")
	ss = stringStream ""
	
	seed 12345
	for i = 1 to 50000 do
	(
		wordsCnt = random 10 20
		wArr = for j = 1 to wordsCnt collect (wordsArr[random 1 24])
		tabCnt = random 0 5
		
		str = ""
		if mod i 17 == 0 then
			str = tabNl
		else
		(
			if mod i 33 == 0 then
				str = nl
			else
			(
				for t = 1 to tabCnt do str += tab
				
				for w in wArr do str += space + w
			)
		)
		
		format fmt str to:ss
	)
	--)
	
	
	strToCheck = toLower (ss as string)
		

	gc()
	t0 = timestamp()
	
	

	/*
	ss = strToCheck as stringstream 
	seek ss 0
	while not eof ss do
	(
		ln = trimLeft (readline ss) " 	"
		
		if MatchPattern ln pattern:"kappa*"   then count += 1 else
		if MatchPattern ln pattern:"omicron*" then count += 1 else
		if MatchPattern ln pattern:"upsilon*" do count += 1		
	)
	*/
	
	dp = (dotNetClass "System.Activator").CreateInstance (dp_assembly.GetType("DocProcessor"))
	dp.ProcessDocument strToCheck
	
	t1 = timestamp()
	format "Find words %  sec.\n" ((t1-t0)/1000.0)
	
	format "kappaArr: %\n" dp.kappa.count
	format "omicronArr: %\n" dp.omicron.count
	format "upsilonArr: %\n" dp.upsilon.count
)

Jan 16, 2023 1:31 am

it might be a smarter way, but I use the first that comes to mind:

cmd = python.import "__builtin__"
cmd.list k as array

Jan 16, 2023 1:31 am

compared to a wall of text needed for c# python is definitely a winner

Jan 16, 2023 1:31 am

maybe (and very likely) we might have a memory issue with python in MXS