[Closed] Fast search through text file

Jan 17, 2023 10:14 pm

So, which method is preferable – to preinitialize the dict or to use your last update?

1 Reply

(@serejah)

Joined: 2 years ago

Posts: 0

Jan 17, 2023 10:14 pm

depends on how do you plan to use it.
if you’re sure that words that aren’t in the list shouldn’t be collected why adding them? go with the preinitialized dict
it is all about the performance
if there’s no much difference collect everything and then retrieve from the dict by the key, then of course you’ll have to keep else statement

…
just to be clear, if you go with the predifined dict you won’t have —- key in the dict after you process the file.
.

upd
this one, must be the fastest version so far. ~45ms

code

public class Line
{
	public int count = 0;
	public List<string> lines;
	public List<int> indexes;
	
	public Line()
	{
		count = 0;
		lines   = new List<string>(){};
		indexes = new List<int>();
	}
	
	
	public Line( string line, int index )
	{
		count   = 1;
		lines   = new List<string>(){ line };
		indexes = new List<int>(){ index };
	}
	
	public void AddLine( string line, int index )
	{
		lines.Add( line );
		indexes.Add( index );
		count++;
	}
}

public class TextProcessor
{
	public Dictionary<string, Line> data;
	public string[] keys;
	public Line[] lines;
	
	public void ProcessFile(string file , string[] words)
	{	
		data = new Dictionary<string, Line>();
		
		//foreach( var w in words) data[w] = new Line();
		
		var spaces = new char[]{' ','	','\r','\n'};
		string[] lines = File.ReadAllLines(file);

		if (lines != null)
		{
			int line_index = 0;
			
			string word;
			
			foreach(string line in lines)
			{
				line_index++;
				
				// #1
				//word = line.TrimStart(spaces).Split()[0].ToLower();
				
				// #2
				//word = line.TrimStart(spaces);
				//int len  = word.IndexOfAny( spaces );
				//word = word.Substring( 0, len < 0 ? word.Length : len ).ToLower();
				
				// #3
				//var chars = line.ToCharArray();						
				//int f1 = Array.FindIndex(chars,     x => !char.IsWhiteSpace(x));						
				//if ( f1 < 0 ) continue;						
				//int f2 = Array.FindIndex(chars, f1, x => char.IsWhiteSpace(x));
				//word = line.Substring( f1, f2 - f1 ).ToLower();
				
				// #4						
				int f1 = -1;
				for ( int i = 0; i < line.Length; i++ )
				{
					if ( !char.IsWhiteSpace(line[i]) )
					{
						f1 = i;
						break;
					}
				}
				
				if ( f1 < 0 ) continue;
				int f2 = -1;
				for ( int i = f1; i < line.Length; i++ )
				{
					if ( char.IsWhiteSpace(line[i]) )
					{
						f2 = i;
						break;
					}
				}
				word = line.Substring( f1, f2 - f1 ).ToLower();
										
				if (data.ContainsKey(word))
				{
					data[word].AddLine( line, line_index );							
				}
				else
				{
					data[word] = new Line( line, line_index );							
				}
			}
		}
		keys = new string[data.Keys.Count];
		data.Keys.CopyTo(keys, 0);				
		this.lines = new Line[data.Values.Count];
		data.Values.CopyTo(this.lines, 0);
	}
	
}

denisT

Jan 17, 2023 10:14 pm

// #1.5
var word = line.TrimStart().Split(null, 2)[0].ToLower();

Jan 17, 2023 10:14 pm

Yep, the last one is the fastest.

When I make it to work with string, for some files there is an error:
– Runtime error: .NET runtime exception: Length cannot be less than zero.

Found why. If a line starts with:

/

or with:
----
there is an error.

The same code here works with no errors: https://dotnetfiddle.net/9KqZve

The times I have with different versions:

pure maxscript: 0.68 sec
first C# version with regEx: 0.33 sec.
last C# version: 0.26 sec – including time to fill maxscript arrays with the data. Only C# code is 0.05 sec

Jan 17, 2023 10:14 pm

well, I doubt that this test file that you provided is a good test case to ensure that the solution is bug-free.

change this line like that and it will work
int f2 = line.Length;

nah, still two times slower. ~100ms

1 Reply

(@miauu)

Joined: 2 years ago

Posts: 0

Jan 17, 2023 10:14 pm

Fixed, thank you.

For now everything works as expected – fast and accurate.

Jan 17, 2023 10:14 pm

cool, perhaps there’s nothing left to optimize
even the latest .Net 7 version shows 28ms which isn’t far away from 40+ms on my device

curious, what is it all for? some kind of obfuscator / source file analyzer?

1 Reply

(@miauu)

Joined: 2 years ago

Posts: 0

Jan 17, 2023 10:14 pm

To make navigating in scripts more user friendly.
You know that Ctrl+RMB click gives you a menu where you can see the controls, functions, events, etc. used in the current script and you can go whenever you want by clicking the desired item.
When I see whole screen(literally) full with items to click, and there are more items not shown… not an easy task to find what you need.
So, I created this:

For the currently opened script, the [ </> ] shows, buttons for all available controls + local and global vars + struct names + functions + lines where I have format and print. Clicking a button shows the data. Clicking a row in the list shows the line number of the selected text, double clicking a row selects the same line in the MaxScript editor and make it visible, so navigation is easy and fast.
The button with the bookmark icon shows all bookmarks for the current script. The navigation is the same – click a row in the list to go to desired bookmark.
The [ /* ] button shows all comments which starts with –. The same way of navigation.
With the filter box I can find what I need much faster.
And most importantly, with your help and the help of the Denis collecting the data is much faster than Ctrl+RMB click inside the MaxScritp Editor.

What I have in my ToDo list:

collect variables inside structs
find a faster way to populate the listview. It takes more than 1,5 sec to fill it with 4000+ items.

denisT

Jan 17, 2023 10:14 pm

what a bore you are!

			public static string GetFirstWord(string line)
			{
				int i = 0;
				while (i < line.Length && char.IsWhiteSpace(line, i)) { i++; }
				int k = i;
				while (k < line.Length && !char.IsWhiteSpace(line, k)) { k++; }
				return line.Substring(i, k-i);
			}

btw… on my machine the difference for pure searching and using trim and split is 25 vs 40. So it’s not a big deal.

Jan 17, 2023 10:14 pm

shouldn’t have said that

nice, sometimes it drops even below 40ms

Jan 17, 2023 10:14 pm

looks really good. must be very convenient to use for anyone who is too lazy to switch to vscode or similar.

listview supports virtualization, maybe that’s the way to go. (never used it myself)

did you mean struct properties?
it shouldn’t be a big deal, find a struct definition, then find an opening ( and closing ) and tokenize everything what’s inside that range

code

just the idea

...
re_options_m = dotNet.combineEnums (dotNetClass "System.Text.RegularExpressions.RegexOptions").MultiLine (dotNetClass "System.Text.RegularExpressions.RegexOptions").IgnoreCase,
	
...
				
--  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --
--		2. Collect struct defs
--  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --
(
local pattern_struct = "\bstruct\s(\w+)\b"
matches = (dotNetClass "System.Text.RegularExpressions.RegEx").Matches code pattern_struct re_options_m

for i = 0 to matches.count - 1 do 
(
	local item        = matches.item[i].groups.item[1]
	local match_start = 1 + item.index
	local match_end   = match_start + item.value.count
	
	if valid[ match_start ] and valid[ match_end ] do
	(
		append STRUCT_TOKENS ( Token type:"struct" value:item.value start:match_start end:match_end )				
	)			
)

1 Reply

(@miauu)

Joined: 2 years ago

Posts: 0

Jan 17, 2023 10:14 pm

Thank you.
I use vscode for Pyhton and Powershell, but I can’t force myself to use it(along with notepad++) for maxscript.

Thank you. Will check it in the coming days. First I have to “fix and arrange” the code of the whole script.

Yes. The functions, defined inside structs, are collected by the script, but the variables does not start with local or global, so they are not collected.